Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Scheduling and Resource Management for Nextgeneration Clusters Yanyong Zhang Penn State University www.cse.psu.edu/~yyzhang What is a Cluster? •Cost effective •Easily scalable •Highly available •Readily upgradeable Scientific & Engineering Applications • HPTi win 5 year $15M procurement to provide systems for weather modeling (NOAA). (http://www.noaanews.noaa.gov/stories/s419.htm) • Sandia's expansion of their Alpha-based C-plant system. • Maui HPCC LosLobos Linux Super-cluster (http://www.dl.ac.uk/CFS/benchmarks/beowulf/tsld007.htm) • A performance-price ratio of … is demonstrated in simulations of wind instruments using a cluster of 20 …. (http://www.swiss.ai.mit.edu/~pas/p/sc95.html) • The PC cluster based parallel simulation environment and the technologies … will have a positive impact on networking research nationwide …. (http://www.osc.edu/press/releases/2001/approved.shtml) Commercial Applications • Business applications – Transaction Processing (IBM DB2, oracle …) – Decision Support System (IBM DB2, oracle …) • Internet applications – – – – Web serving / searching (Google.Com …) Infowares (yahoo.Com, AOL.Com) Email, eChat, ePhone, eBook,eBank, eSociety, eAnything Computing portal Resource Management • Each application is demanding • Several applications/users can be present at the same time Resource management and Quality-of-service become important. System Model Arrival Q 34 4 P0 P1 P2 P3 High Speed Network P4 • Each node is independent • Maximum MPL • Arrival queue Two Phases in Resource Management • Allocation Issues – Admission Control – Arrival Queue Principle • Scheduling Issues (CPU Scheduling) – Resource Isolation – Co-allocation Co-allocation / Coscheduling P0 t0 P1 RECV Scheduling skewness t1 switch SEND TIME Outline • From OS’s perspective NEXT – Contribution 1: boosting the CPU utilization at supercomputing centers – Contribution 2: providing quick responses for commercial workloads – Contribution 3: scheduling multiple classes of applications • From application’s perspective – Contribution 4: optimizing clustered DB2 Contribution 1: Boosting CPU Utilization at Supercomputing Centers Objective Response Time Wait Time Wait in the arrival Q slowdown = minimize Execute Time Wait in the ready/blocked Q Response Time Execute Time in Isolation Existing Techniques time • Back Filling (BF) 8 2 3 6 2 # of CPUs = 14 8 space • Gang Scheduling (GS) 5 3 2 6 • Migration (M) 2 2 2 Proposed Scheme • MBGS = GS + BF + M – Use GS as the basic framework – At each row of GS matrix, apply BF technique – Whenever GS matrix is re-calculated, M should be considered. How Does MBGS Perform? Outline • From OS’s perspective NEXT – Contribution 1: boosting the CPU utilization at supercomputing centers – Contribution 2: providing quick responses for commercial workloads – Contribution 3: scheduling multiple classes of applications • From application’s perspective – Contribution 4: optimizing clustered DB2 Contribution 2: Reducing Response Times for Commercial Applications Objective Response Time Wait Time Wait in the arrival Q Execute Time Wait in the ready/blocked Q •Minimize wait time •Minimize response time Previous Work I: Gang Scheduling (GS) (1) MINUTES ! (2) wasted GS is not responsive enough ! Previous Work II: Dynamic Co-scheduling B just gets a msg P0 P1 P2 P3 B D A C Everybody else is blocked It’s A’s C just finishes I/O turn The scheduler on each node makes independent decision based on local events without global synchronizations. Dynamic Co-scheduling Heuristics How do you wait for a message? Busy Wait No Explicit Reschedule What do you do on message Interrupt & Reschedule arrival? Periodically Reschedule Spin Block Spin Yield Local SB SY DCS DCS-SB DCS-SY PB PB-SB PB-SY Simulation Study • A detailed simulator at a microsecond granularity • System parameters – System configurations (maximum MPL, to partition or not) – System overheads (context switch overheads, interrupt costs, costs associated with manipulating queues) Simulation Study (Cont’d) • Application parameters – Injection load – Characteristics (CPU intensive, IO intensive, communication intensive or somewhere in the middle) Impact of Load Impact of Workload Characteristics Comm intensive I/O intensive Average Job Response Time (X10000 seconds) Periodic Boost Heuristics • S1: Compute Phase • S2: S1 + Unconsumed Msg. • S3: Recv. + Msg. Arrived • S4: Recv. + No Msg. 2.9 2.8 2.7 2.6 2.5 2.4 2.3 A B C D E • • • • • A: B: C: D: E: S3-> {S2,S1} S3->S2->S1 {S3,S2,S1} {S3,S2}->S1 S2->S3->S1 Analytical Modeling Study • The state space is impossible to handle. P0 P1 P2 P3 Pp High Speed Network … … Dynamic arrival Analysis Description i X i, jA, _ _ B j1 ,…,jPBi+, jA1, …, mA, _ _ B B B jk ik, jkR, jk,1 ,…, jk,i k number of nodes M _ , ik1,…,iM, jkR(l) 1,…,iM, B jk,l 1,…,N, jkQ 1,…,mQ+mO, n k1,…,P, N l Number of jobs on node k l=1 Original State Space (impossible to handle!!) Assumption: The state of each processor is stochastically independent and identical to the state of the other processors. _ _ B A Q i A A R B + Y i, j , j ,j1,…, jiM,j i , j 1, …, m , jR(l) 1,…,iM, jkB1,…,N, jQ 1,…,mQ+mO Reduced State Space (much more tractable !! ) Analysis Description (Cont) Address the state transition rates using Continuous Markov model; Build the Generator Matrix Q Get the invariant probability vector by solving Q = 0, and e = 1. Use fixed-point iteration to get the solution SB Example 1 IO 2 IO 2 C 1 IO 1 C 2 C Q r1 = P( 2 C 1 C Q … … … 1 C 2 C … 1 SN 1x(1-P1) 1 SP 2 C 2 C 2 C 1 SN … 1 C 2 C 1 2 C 1 B 2 C 1 SP … 1 B 2 IO r1’ … 1 C 2 C 2 C 1 B 1 C )x 1 1 +{P( 1 IO )+P( 2 * )}x 2 * 2 * 1 IO 1/1+1/1+1/1 1/1+1/1 +P( 1 SN )x 1 r2 = 2 C 1 IO … 2 * … … … Results Optimal PB Frequency Optimal Spin Time for SB Results – Optimal Quantum Length Comm Intensive CPU Intensive I/O Intensive Outline • From OS’s perspective NEXT – Contribution 1: boosting the CPU utilization at supercomputing centers – Contribution 2: providing quick responses for commercial workloads – Contribution 3: scheduling multiple classes of applications • From application’s perspective – Contribution 4: optimizing clustered DB2 Contribution 3: Scheduling Multiple Classes of Applications interactive batch real time Objective BE RT cluster How long did it take me to finish?? Response time How many deadlines have been missed? Miss rate Fairness Ratio (x:y) Cluster Resource x x+y y x+y RT BE How to Adhere to Fairness Ratio? P0 time RT1 RT2 BE 1GS P1 P0 RT BE 2DCS-TDM x:y = 2:1 P1 time P1 time P0 2DCS-PS BE response time RT : BE = 2:1 RT : BE = 1:9 RT : BE = 9:1 RT Deadline Miss Rate RT : BE = 1:9 RT : BE = 2:1 RT : BE = 9:1 Outline • From OS’s perspective – Contribution 1: boosting the CPU utilization at supercomputing centers – Contribution 2: providing quick responses for commercial workloads – Contribution 3: scheduling multiple classes of applications • From application’s perspective NEXT – Characterizing decision support workloads on the clustered database server – Resource management for transaction processing workloads on the clustered database server Experiment Setup • IBM DB2 Universal Database for Linux, EEE, Version 7.2 • 8 dual node Linux/Pentium cluster, that has 256 MB RAM and 18 GB disk on each node. • TPC-H workload. Queries are run sequentially (Q1 – Q20). Completion time for each query is measured. Platform Client 1 5 Myrinet 2 4 4 2 4 2 2 4 3 3 3 3 A 001 B 002 C 003 D 004 A 001 B 002 C 003 D 004 Table T Select * from T A 001 B 002 C 003 D 004 3 Server coordinator node Methodology • Identify the components with high system overhead. • For each such component, characterize the request distribution. • Come up with ways of optimization. • Quantify potential benefits from the optimization. Sampling OS Statistics • Sample the statistics provided by stat, net/dev, process/stat. – – – – – – User/system CPU % # of pages faults # of blocks read/written # of reads/writes # of packets sent/received CPU utilization during I/O Kernel Instrumentation • Instrument each system call in the kernel. Enter system call unblock block Exit system call resume execution Operating System Profile • Considerable part of the execution time is taken by pread system call. • There is good overlap of computation with I/O for some queries. • More reads than writes. TPC-H pread Overhead Query % of exe time Query % of exe time Q6 20.0 Q13 10.0 Q14 19.0 Q3 9.6 Q19 16.9 Q4 9.1 Q12 15.4 Q18 9.0 Q15 13.4 Q20 7.9 Q7 12.1 Q2 5.2 Q17 10.8 Q9 5.2 Q8 10.5 Q5 4.6 Q10 10.3 Q16 4.1 Q1 10.0 Q11 3.5 pread overhead = # of preads X overhead per pread. pread Optimization page table pread(dest, chunk) { for each page in the chunk { if the page is not in cache { bring it in from disk } copy the page into dest } } user space 2 page cache 30s 1 Optimization: •Re-mapping the buffer •Copy on write Copy-on-write Query user space read only page cache % reduction Query % reduction Q1 98.9 Q11 96.1 Q2 85.7 Q12 87.1 Q3 96.0 Q13 100.0 Q4 80.9 Q14 96.1 Q5 100.0 Q15 96.8 Q6 100.0 Q16 70.7 Q7 79.7 Q17 94.5 Q8 79.3 Q18 100.0 Q9 88.7 Q19 95.7 Q10 77.8 Q20 94.4 % reduction = 1 - # of copy-on-write # of preads Operating System Profile • Socket calls are the next dominant system calls. Message Characteristics Q11 Q16 Message Size (bytes) Message Inter-injection Time (Millisecond) Message Destination Observations on Messages • Only a small set of message sizes is used. • Many messages are sent in a short period. • Message destination distribution is uniform. • Many messages are point-to-point implementations of multicast/broadcast messages. • Multicast can reduce # of messages. Potential % Reduction in Messages query total small large query total small large Q1 44.7 71.4 38.7 Q11 9.6 28.6 0.1 Q2 20.4 58.7 0.2 Q12 8.3 7.8 2.9 Q3 48.2 64.3 38.0 Q13 24.5 75.2 0.1 Q4 22.6 58.6 0.1 Q14 27.9 80.4 0.7 Q5 8.0 7.1 8.4 Q15 46.6 56.5 0.7 Q6 76.4 78.6 45.5 Q16 59.1 63.0 56.9 Q7 57.5 71.4 56.2 Q17 41.5 66.7 27.3 Q8 29.1 75.5 4.8 Q18 11.4 32.3 0.0 Q9 66.8 78.5 61.1 Q19 26.7 79.4 0.2 Q10 25.0 73.6 0.1 Q20 21.1 62.8 0.1 Online Algorithm Send ( msg, dest ) { send msg to node dest; } Send ( msg, dest ) { if (msg = buffered_msg && dest dest_set) dest_set = dest_set { dest } ; else buffer the msg; } Send_bg () { foreach buffered_msg if ( it has been buffered longer than threshold ) send multicast msg to nodes in dest_set; } Impact of Threshold Q7 Threshold (millisecond) Q16 Threshold (millisecond) Outline • From OS’s perspective – Contribution 1: boosting the CPU utilization at supercomputing centers – Contribution 2: providing quick responses for commercial workloads – Contribution 3: scheduling multiple classes of applications • From application’s perspective NEXT – Characterizing decision support workloads on the clustered database server – Resource management for clustered database applications Ongoing/Near-term Work • What is the optimal number of jobs which should be admitted? • Can we dynamically pause some processes based on resource requirement and resource availability? • Which dynamic co-scheduling scheme works best here? • How do we exploit application level information in scheduling? Future Work • Some next-generation applications – Real time medical imaging and collaborative surgery Application requirements: • VAST processing power, disk capacity and network bandwidth • absolute availability • deterministic performance Future Work – E-business on demand Requirements: • performance more users responsive Quality-of-service • availability • security • power consumption • pricing model Future Work • What does it take to get there? – – – – – Hardware innovations Resource management and isolation Good scalability High availability Deterministic Performance Future Work • Not only high performance – – – – – – Energy consumption Security Pricing for service User satisfaction System management Ease of use Related Work • parallel job scheduling: – Gang Scheduling [Ousterhout82] – Backfilling ([Lifka95], [Feitelson98]) – Migration ([Epima96]) • Dynamic co-scheduling: – Spin Block ([Arpaci-Dusseau98], [Anglano00]), – Periodic Boost ([Nagar99]) – Demand-based Coscheduling ([Sobalvarro97]), Related Work (Cont’d) • Real-time Scheduling: – Earliest Deadline First – Rate Monotonic – Least Laxity First • Single node Multi-class scheduling – Hierarchical scheduling ([Goyal96]) – Proportional share ([Waldspurger95]) • Commercial clustered server (Pai[98], reserve) Related Work (Cont’d) • Commercial Workloads (CAECW, [Barford99], Kant[99]) • Database Characterizing ([Keeton99], [Ailamaki99], [Rosenblum97]) • OS support for database ([Stonebraker81], [Gray78], [Christmann87]) • Reducing copies in IO ([Pai00], [Druschel93], [Thadani95]) Publications • IEEE Transactions on Parallel and Distributed Systems. • International Parallel and Distributed Processing Symposium (IPDPS 2000) • ACM International Conference on Supercomputing (ICS 2000) • International Euro-par Conference (Europar 2000) • ACM Symposium on Parallel Algorithms and Architectures (SPAA 2001) • Workshop on Job Scheduling Strategies for Parallel Processing (JSSPP 2001) • Workshop on Computer Architecture Evaluation Using Commercial Workloads (CAECW 2002) Publications I: Batch Applications • • • • Y. Zhang, H. Franke, J. Moreira, A. Sivasubramaniam. An Integrated Approach to Parallel Scheduling Using Gang-Scheduling,Backfilling and Migration, 7th Workshop on Job Scheduling Strategies for Parallel Processing. Y. Zhang, H. Franke, J. Moreira, A. Sivasubramaniam. The Impact of Migration on Parallel Job Scheduling for Distributed Systems. Proceedings of 6th International Euro-Par Conference Lecture Notes in Computer Science 1900, pages 242-251, Munich, Aug/Sep 2000. Y. Zhang, H. Franke, J. Moreira, A. Sivasubramaniam. Improving Parallel Job Scheduling by combining Gang Scheduling and Backfilling Techniques. International Parallel and Distributed Processing Symposium (IPDPS'2000), pages 133-142, May 2000. Y. Zhang, H. Franke, J. Moreira, A. Sivasubramaniam. A Comparative Analysis of Space- and Time-Sharing Techniques for Parallel Job Scheduling in Large Scale Parallel Systems. Submitted to IEEE Transactions on Parallel and Distributed Systems. Publications II: Interactive Applications • • • • M. Squillante, Y. Zhang, A. Sivasubramaniam, N. Gautam, H. Franke, J. Moreira. Analytic Modeling and Analysis of Dynamic Coscheduling for a Wide Spectrum of Parallel and Distributed Environments. Penn State CSE tech report CSE-01-004. Y. Zhang, A. Sivasubramaniam, J. Moreira, H. Franke. Impact of Workload and System Parameters on Next Generation Cluster Scheduling Mechanisms. To appear in IEEE Transactions on Parallel and Distributed Systems. Y. Zhang, A. Sivasubramaniam, H. Franke, J. Moreira. A Simulationbased Performance Study of Cluster Scheduling Mechanisms. 14th ACM International Conference on Supercomputing (ICS'2000), pages 100-109, May 2000. M. Squillante, Y. Zhang, A. Sivasubramaniam, N. Gautam, H. Franke, J. Moreira. Analytic Modeling and Analysis of Dynamic Coscheduling for a Wide Spectrum of Parallel and Distributed Environments. Submitted to ACM Transactions on Modeling and Compute Simulation (TOMACS). Publications III: Multi-class Applications • • Y. Zhang, A. Sivasubramaniam.Scheduling Best-Effort and Real-Time Pipelined Applications on Time-Shared Clusters, the 13th Annual ACM symposium on Parallel Algorithms and Architectures. Y. Zhang, A. Sivasubramaniam.Scheduling Best-Effort and Real-Time Pipelined Applications on Time-Shared Clusters, Submitted to IEEE Transactions on Parallel and Distributed Systems. Publications IV: Database • Y. Zhang, J. Zhang, A. Sivasubramaniam, C. Liu, H. Franke. Decision-Support Workload Characteristics on a Clustered Database Server from the OS Perspective. Penn State Technical Report CSE-01-003 Thank You ! I/O Characteristics (Q6)