Download Scheduling and Resource Management for Next

Document related concepts

Distributed operating system wikipedia , lookup

Transcript
Scheduling and Resource
Management for Nextgeneration Clusters
Yanyong Zhang
Penn State University
www.cse.psu.edu/~yyzhang
What is a Cluster?
•Cost effective
•Easily scalable
•Highly available
•Readily upgradeable
Scientific & Engineering
Applications
• HPTi win 5 year $15M procurement to provide systems for
weather modeling (NOAA).
(http://www.noaanews.noaa.gov/stories/s419.htm)
• Sandia's expansion of their Alpha-based C-plant system.
• Maui HPCC LosLobos Linux Super-cluster
(http://www.dl.ac.uk/CFS/benchmarks/beowulf/tsld007.htm)
• A performance-price ratio of … is demonstrated in
simulations of wind instruments using a cluster of 20 ….
(http://www.swiss.ai.mit.edu/~pas/p/sc95.html)
• The PC cluster based parallel simulation environment and the
technologies … will have a positive impact on networking
research nationwide ….
(http://www.osc.edu/press/releases/2001/approved.shtml)
Commercial Applications
• Business applications
– Transaction Processing (IBM DB2, oracle …)
– Decision Support System (IBM DB2, oracle …)
• Internet applications
–
–
–
–
Web serving / searching (Google.Com …)
Infowares (yahoo.Com, AOL.Com)
Email, eChat, ePhone, eBook,eBank, eSociety, eAnything
Computing portal
Resource Management
• Each application is demanding
• Several applications/users can be
present at the same time
Resource management and Quality-of-service
become important.
System Model
Arrival Q
34 4
P0
P1
P2
P3
High
Speed
Network
P4
• Each node is
independent
• Maximum MPL
• Arrival queue
Two Phases in Resource
Management
• Allocation Issues
– Admission Control
– Arrival Queue Principle
• Scheduling Issues (CPU Scheduling)
– Resource Isolation
– Co-allocation
Co-allocation / Coscheduling
P0
t0
P1
RECV
Scheduling skewness
t1
switch
SEND
TIME
Outline
• From OS’s perspective
NEXT
– Contribution 1: boosting the CPU utilization
at supercomputing centers
– Contribution 2: providing quick responses for
commercial workloads
– Contribution 3: scheduling multiple classes of
applications
• From application’s perspective
– Contribution 4: optimizing clustered DB2
Contribution 1:
Boosting CPU Utilization at
Supercomputing Centers
Objective
Response Time
Wait Time
Wait in the
arrival Q
slowdown =
minimize
Execute Time
Wait in the
ready/blocked Q
Response Time
Execute Time in Isolation
Existing Techniques
time
• Back Filling (BF)
8
2
3
6
2
# of CPUs = 14
8
space
• Gang Scheduling (GS)
5
3
2
6
• Migration (M)
2
2
2
Proposed Scheme
• MBGS = GS + BF + M
– Use GS as the basic framework
– At each row of GS matrix, apply BF
technique
– Whenever GS matrix is re-calculated,
M should be considered.
How Does MBGS Perform?
Outline
• From OS’s perspective
NEXT
– Contribution 1: boosting the CPU utilization
at supercomputing centers
– Contribution 2: providing quick responses for
commercial workloads
– Contribution 3: scheduling multiple classes of
applications
• From application’s perspective
– Contribution 4: optimizing clustered DB2
Contribution 2:
Reducing Response Times for
Commercial Applications
Objective
Response Time
Wait Time
Wait in the
arrival Q
Execute Time
Wait in the
ready/blocked
Q
•Minimize wait time
•Minimize response time
Previous Work I:
Gang Scheduling (GS)
(1)
MINUTES !
(2)
wasted
GS is not responsive enough !
Previous Work II:
Dynamic Co-scheduling
B just gets
a msg
P0
P1
P2
P3
B
D
A
C
Everybody
else is
blocked
It’s
A’s
C just finishes
I/O
turn
The scheduler on each node makes independent
decision based on local events without global
synchronizations.
Dynamic Co-scheduling
Heuristics
How do you wait for a message?
Busy Wait
No Explicit
Reschedule
What do
you do on
message Interrupt &
Reschedule
arrival?
Periodically
Reschedule
Spin Block
Spin Yield
Local
SB
SY
DCS
DCS-SB
DCS-SY
PB
PB-SB
PB-SY
Simulation Study
• A detailed simulator at a
microsecond granularity
• System parameters
– System configurations (maximum MPL,
to partition or not)
– System overheads (context switch
overheads, interrupt costs, costs
associated with manipulating queues)
Simulation Study (Cont’d)
• Application parameters
– Injection load
– Characteristics (CPU intensive, IO
intensive, communication intensive or
somewhere in the middle)
Impact of Load
Impact of Workload
Characteristics
Comm intensive
I/O intensive
Average Job Response Time (X10000
seconds)
Periodic Boost Heuristics
• S1: Compute Phase
• S2: S1 + Unconsumed
Msg.
• S3: Recv. + Msg. Arrived
• S4: Recv. + No Msg.
2.9
2.8
2.7
2.6
2.5
2.4
2.3
A
B
C
D
E
•
•
•
•
•
A:
B:
C:
D:
E:
S3-> {S2,S1}
S3->S2->S1
{S3,S2,S1}
{S3,S2}->S1
S2->S3->S1
Analytical Modeling Study
• The state space is impossible to
handle.
P0
P1
P2
P3
Pp
High Speed
Network
…
…
Dynamic arrival
Analysis Description
i
X
 i,
jA,
_
_
B
j1 ,…,jPBi+, jA1, …, mA,
_
_
B
B
B
jk  ik, jkR, jk,1
,…, jk,i
k
number of nodes
M
_
, ik1,…,iM, jkR(l) 1,…,iM,
B
jk,l
1,…,N,
jkQ 
1,…,mQ+mO,
n
k1,…,P, N   
l
Number of jobs on node k
l=1
Original State Space (impossible to handle!!)
Assumption: The state of each processor is
stochastically independent and identical to the
state of the other processors.
_
_
B
A
Q
i
A
A
R
B
+
 Y   i, j , j ,j1,…, jiM,j i , j 1, …, m , jR(l) 1,…,iM,
jkB1,…,N, jQ  1,…,mQ+mO 

Reduced State Space (much more tractable !! )
Analysis Description (Cont)

Address the state transition rates using
Continuous Markov model; Build the
Generator Matrix Q

Get the invariant probability vector  by
solving Q = 0, and e = 1.

Use fixed-point iteration to get the solution
SB Example
1 IO
2 IO
2 C
1 IO
1 C
2 C
Q
r1 = P(
2 C
1 C
Q
…
…
…
1 C
2 C
…
1 SN 1x(1-P1) 1 SP
2 C
2 C
2 C
1 SN
…
1 C
2 C
1
2 C
1 B
2 C
1 SP
…
1 B
2 IO
r1’
…
1 C
2 C
2 C
1 B
1 C )x
1
1
+{P( 1 IO )+P( 2 * )}x
2 *
2 *
1 IO
1/1+1/1+1/1
1/1+1/1
+P( 1 SN )x 1
r2 =
2 C
1 IO
…
2 *
…
…
…
Results
Optimal PB Frequency
Optimal Spin Time for SB
Results – Optimal Quantum
Length
Comm Intensive
CPU Intensive
I/O
Intensive
Outline
• From OS’s perspective
NEXT
– Contribution 1: boosting the CPU utilization
at supercomputing centers
– Contribution 2: providing quick responses for
commercial workloads
– Contribution 3: scheduling multiple classes of
applications
• From application’s perspective
– Contribution 4: optimizing clustered DB2
Contribution 3:
Scheduling Multiple Classes of
Applications
interactive
batch
real
time
Objective
BE
RT
cluster
How long did
it take me to
finish??
Response time
How many
deadlines
have been
missed?
Miss rate
Fairness Ratio (x:y)
Cluster Resource
x
x+y
y
x+y
RT
BE
How to Adhere to Fairness
Ratio?
P0
time
RT1
RT2
BE
1GS
P1
P0
RT
BE
2DCS-TDM
x:y = 2:1
P1
time
P1
time
P0
2DCS-PS
BE response time
RT : BE = 2:1
RT : BE = 1:9
RT : BE = 9:1
RT Deadline Miss Rate
RT : BE = 1:9
RT : BE = 2:1
RT : BE = 9:1
Outline
• From OS’s perspective
– Contribution 1: boosting the CPU utilization at
supercomputing centers
– Contribution 2: providing quick responses for
commercial workloads
– Contribution 3: scheduling multiple classes of
applications
• From application’s perspective
NEXT
– Characterizing decision support workloads on the
clustered database server
– Resource management for transaction processing
workloads on the clustered database server
Experiment Setup
• IBM DB2 Universal Database for Linux, EEE,
Version 7.2
• 8 dual node Linux/Pentium cluster, that has 256
MB RAM and 18 GB disk on each node.
• TPC-H workload. Queries are run sequentially
(Q1 – Q20). Completion time for each query is
measured.
Platform
Client
1
5
Myrinet
2
4
4 2 4
2
2 4
3
3
3
3
A 001
B 002
C 003
D 004
A
001
B
002
C
003
D
004
Table T
Select *
from T
A
001
B
002
C
003
D
004
3 Server
coordinator
node
Methodology
• Identify the components with high
system overhead.
• For each such component, characterize
the request distribution.
• Come up with ways of optimization.
• Quantify potential benefits from the
optimization.
Sampling OS Statistics
• Sample the statistics provided by
stat, net/dev, process/stat.
–
–
–
–
–
–
User/system CPU %
# of pages faults
# of blocks read/written
# of reads/writes
# of packets sent/received
CPU utilization during I/O
Kernel Instrumentation
• Instrument each system call in the
kernel.
Enter
system
call
unblock
block
Exit
system
call
resume
execution
Operating System Profile
• Considerable part of the execution
time is taken by pread system call.
• There is good overlap of
computation with I/O for some
queries.
• More reads than writes.
TPC-H pread Overhead
Query
% of exe time
Query % of exe time
Q6
20.0
Q13
10.0
Q14
19.0
Q3
9.6
Q19
16.9
Q4
9.1
Q12
15.4
Q18
9.0
Q15
13.4
Q20
7.9
Q7
12.1
Q2
5.2
Q17
10.8
Q9
5.2
Q8
10.5
Q5
4.6
Q10
10.3
Q16
4.1
Q1
10.0
Q11
3.5
pread overhead =
# of preads X overhead per pread.
pread Optimization
page
table
pread(dest, chunk) {
for each page in the chunk {
if the page is not in cache {
bring it in from disk
}
copy the page into dest
}
}
user
space
2
page
cache
30s
1
Optimization:
•Re-mapping the
buffer
•Copy on write
Copy-on-write
Query
user
space
read only
page
cache
% reduction
Query
% reduction
Q1
98.9
Q11
96.1
Q2
85.7
Q12
87.1
Q3
96.0
Q13
100.0
Q4
80.9
Q14
96.1
Q5
100.0
Q15
96.8
Q6
100.0
Q16
70.7
Q7
79.7
Q17
94.5
Q8
79.3
Q18
100.0
Q9
88.7
Q19
95.7
Q10
77.8
Q20
94.4
% reduction = 1 -
# of copy-on-write
# of preads
Operating System Profile
• Socket calls are the next dominant
system calls.
Message Characteristics
Q11
Q16
Message Size
(bytes)
Message Inter-injection
Time (Millisecond)
Message Destination
Observations on Messages
• Only a small set of message sizes is used.
• Many messages are sent in a short period.
• Message destination distribution is uniform.
• Many messages are point-to-point implementations of
multicast/broadcast messages.
• Multicast can reduce # of messages.
Potential % Reduction in
Messages
query
total
small
large
query total
small
large
Q1
44.7
71.4
38.7
Q11
9.6
28.6
0.1
Q2
20.4
58.7
0.2
Q12
8.3
7.8
2.9
Q3
48.2
64.3
38.0
Q13
24.5
75.2
0.1
Q4
22.6
58.6
0.1
Q14
27.9
80.4
0.7
Q5
8.0
7.1
8.4
Q15
46.6
56.5
0.7
Q6
76.4
78.6
45.5
Q16
59.1
63.0
56.9
Q7
57.5
71.4
56.2
Q17
41.5
66.7
27.3
Q8
29.1
75.5
4.8
Q18
11.4
32.3
0.0
Q9
66.8
78.5
61.1
Q19
26.7
79.4
0.2
Q10
25.0
73.6
0.1
Q20
21.1
62.8
0.1
Online Algorithm
Send ( msg, dest ) {
send msg to node dest;
}
Send ( msg, dest ) {
if (msg = buffered_msg && dest  dest_set)
dest_set = dest_set  { dest } ;
else
buffer the msg;
}
Send_bg () {
foreach buffered_msg
if ( it has been buffered longer than threshold )
send multicast msg to nodes in dest_set;
}
Impact of Threshold
Q7
Threshold (millisecond)
Q16
Threshold (millisecond)
Outline
• From OS’s perspective
– Contribution 1: boosting the CPU utilization at
supercomputing centers
– Contribution 2: providing quick responses for
commercial workloads
– Contribution 3: scheduling multiple classes of
applications
• From application’s perspective
NEXT
– Characterizing decision support workloads on the
clustered database server
– Resource management for clustered database
applications
Ongoing/Near-term Work
• What is the optimal number of jobs which
should be admitted?
• Can we dynamically pause some processes based
on resource requirement and resource
availability?
• Which dynamic co-scheduling scheme works best
here?
• How do we exploit application level information
in scheduling?
Future Work
• Some next-generation applications
– Real time medical imaging and collaborative surgery
Application requirements:
• VAST processing power, disk
capacity and network bandwidth
• absolute availability
• deterministic performance
Future Work
– E-business on demand
Requirements:
• performance
 more users
 responsive
 Quality-of-service
• availability
• security
• power consumption
• pricing model
Future Work
• What does it take to get there?
–
–
–
–
–
Hardware innovations
Resource management and isolation
Good scalability
High availability
Deterministic Performance
Future Work
• Not only high performance
–
–
–
–
–
–
Energy consumption
Security
Pricing for service
User satisfaction
System management
Ease of use
Related Work
• parallel job scheduling:
– Gang Scheduling [Ousterhout82]
– Backfilling ([Lifka95], [Feitelson98])
– Migration ([Epima96])
• Dynamic co-scheduling:
– Spin Block ([Arpaci-Dusseau98], [Anglano00]),
– Periodic Boost ([Nagar99])
– Demand-based Coscheduling ([Sobalvarro97]),
Related Work (Cont’d)
• Real-time Scheduling:
– Earliest Deadline First
– Rate Monotonic
– Least Laxity First
• Single node Multi-class scheduling
– Hierarchical scheduling ([Goyal96])
– Proportional share ([Waldspurger95])
• Commercial clustered server (Pai[98],
reserve)
Related Work (Cont’d)
• Commercial Workloads (CAECW,
[Barford99], Kant[99])
• Database Characterizing ([Keeton99],
[Ailamaki99], [Rosenblum97])
• OS support for database
([Stonebraker81], [Gray78],
[Christmann87])
• Reducing copies in IO ([Pai00],
[Druschel93], [Thadani95])
Publications
• IEEE Transactions on Parallel and Distributed Systems.
• International Parallel and Distributed Processing
Symposium (IPDPS 2000)
• ACM International Conference on Supercomputing (ICS
2000)
• International Euro-par Conference (Europar 2000)
• ACM Symposium on Parallel Algorithms and Architectures
(SPAA 2001)
• Workshop on Job Scheduling Strategies for Parallel
Processing (JSSPP 2001)
• Workshop on Computer Architecture Evaluation Using
Commercial Workloads (CAECW 2002)
Publications I:
Batch Applications
•
•
•
•
Y. Zhang, H. Franke, J. Moreira, A. Sivasubramaniam. An Integrated
Approach to Parallel Scheduling Using Gang-Scheduling,Backfilling and
Migration, 7th Workshop on Job Scheduling Strategies for Parallel
Processing.
Y. Zhang, H. Franke, J. Moreira, A. Sivasubramaniam. The Impact of
Migration on Parallel Job Scheduling for Distributed Systems.
Proceedings of 6th International Euro-Par Conference
Lecture Notes
in Computer Science 1900, pages 242-251, Munich, Aug/Sep 2000.
Y. Zhang, H. Franke, J. Moreira, A. Sivasubramaniam. Improving
Parallel Job Scheduling by combining Gang Scheduling and Backfilling
Techniques. International Parallel and Distributed Processing
Symposium (IPDPS'2000), pages 133-142, May 2000.
Y. Zhang, H. Franke, J. Moreira, A. Sivasubramaniam. A Comparative
Analysis of Space- and Time-Sharing Techniques for Parallel Job
Scheduling in Large Scale Parallel Systems. Submitted to IEEE
Transactions on Parallel and Distributed Systems.
Publications II:
Interactive Applications
•
•
•
•
M. Squillante, Y. Zhang, A. Sivasubramaniam, N. Gautam, H. Franke,
J. Moreira. Analytic Modeling and Analysis of Dynamic Coscheduling
for a Wide Spectrum of Parallel and Distributed Environments. Penn
State CSE tech report CSE-01-004.
Y. Zhang, A. Sivasubramaniam, J. Moreira, H. Franke. Impact of
Workload and System Parameters on Next Generation Cluster
Scheduling Mechanisms. To appear in IEEE Transactions on Parallel and
Distributed Systems.
Y. Zhang, A. Sivasubramaniam, H. Franke, J. Moreira. A Simulationbased Performance Study of Cluster Scheduling Mechanisms. 14th
ACM International Conference on Supercomputing (ICS'2000), pages
100-109, May 2000.
M. Squillante, Y. Zhang, A. Sivasubramaniam, N. Gautam, H. Franke,
J. Moreira. Analytic Modeling and Analysis of Dynamic Coscheduling
for a Wide Spectrum of Parallel and Distributed Environments.
Submitted to ACM Transactions on Modeling and Compute Simulation
(TOMACS).
Publications III:
Multi-class Applications
•
•
Y. Zhang, A. Sivasubramaniam.Scheduling Best-Effort and
Real-Time Pipelined Applications on Time-Shared Clusters, the
13th Annual ACM symposium on Parallel Algorithms and
Architectures.
Y. Zhang, A. Sivasubramaniam.Scheduling Best-Effort and
Real-Time Pipelined Applications on Time-Shared Clusters,
Submitted to IEEE Transactions on Parallel and Distributed
Systems.
Publications IV:
Database
•
Y. Zhang, J. Zhang, A. Sivasubramaniam, C. Liu, H. Franke.
Decision-Support Workload Characteristics on a Clustered
Database Server from the OS Perspective. Penn State
Technical Report CSE-01-003
Thank You !
I/O Characteristics (Q6)