Download powerpoint - University of Houston

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Prognostics wikipedia , lookup

Transcript
Performance Estimation for Scheduling on
Shared Networks
Shreeni Venkataramaiah
Jaspal Subhlok
University of Houston
JSSPP 2003
JSSPP 2003, slide 1
Distributed Applications on Networks:
Resource selection, Mapping, Adapting
Data
Model
Sim 2
Pre
Stream
Vis
Sim 1
Application
where is the best
performance
?
Network
JSSPP 2003, slide 2
Resource Selection Framework
Network Model
Application Model
Measured & Forecast
Network Conditions
(Current Resource
Availability)
Predict Application
Performance under
Current Network
Conditions
subject of this paper
Resource Selection
And Scheduling
JSSPP 2003, slide 3
Building “Sharing Performance Model”
Sharing Performace Model: predicts application
performance under given availability of CPU and
network resources.
1. Execute application on a
controlled testbed
–
monitor CPU and
network during
execution
router
cp
u
cp
u
cp
u
cp
u
2. Analyze to generate the sharing performance model
•
•
application resource needs determine performance
application treated as a black box
JSSPP 2003, slide 4
Resource Shared Execution Principles
• Network Sharing
– sharing changes observed bandwidth and latency
– effective application level latency and bandwidth
determine time to transfer a message
• CPU Sharing
– scheduler attempts to give equal CPU time to all
processes
– a competing process is first awarded “idle time”, then
competes to get overall equal share of CPU
JSSPP 2003, slide 5
CPU sharing (1 competing process)
Application using CPU
CPU idle
Competing process using CPU
dedicated
execution
CPU time slice
corresponding
progress
CPU shared
execution
time
• This application keeps the CPU 100% busy …
…execution time doubles with CPU sharing
JSSPP 2003, slide 6
CPU sharing (1 competing process)
Application using CPU
CPU idle
Competing process using CPU
dedicated execution
CPU shared execution
If CPU is mostly idle (less than 50% busy) for dedicated
execution, execution time is unchanged with CPU sharing
dedicated execution
CPU shared execution
If CPU is busy 50 – 100 % time for dedicated execution,
execution time increases between 0 and 100%
• slowdown is predictable if usage pattern is known
JSSPP 2003, slide 7
Shared CPU on All Nodes
Note that each node is scheduled independently
When one process attempts to send a message, the other
might be swapped out leading to a synchronization
wait…
– difficult to model because of timing
– we develop upper and lower bounds on execution time
JSSPP 2003, slide 8
All Shared CPUs:
Lower bound on execution time
Ignore additional synchronization waits due to independent
scheduling……execution time is the maximum of
execution time of nodes, computed individually
This is not necessarily outrageously optimistic! Why ?
Because application processes often get into a lock-step
execution mode on common OSs. Why ?
1. Process A tries to send a message to B…
2. B is not executing, so A gets swapped out but has priority
3. When B is swapped in, A gets back into ready queue
immediately and starts executing
4. Eventually, they start getting scheduled together
JSSPP 2003, slide 9
All Shared CPUs:
Upper bound on execution time
The CPU is in one of these modes during application
execution. Consider the impact of 1 competing compute
intensive load…
Computation: at most doubles
Communication:
– can double because of own CPU scheduling
– can double because of other CPU scheduling
– quadruple in worst case
Idle:
–
idle time is waiting for another computation and/or
communication to finish
– quadruple in worst case.
JSSPP 2003, slide 10
Shared Communication Links
•
It is assumed that we know at runtime:
– expected latency and bandwidth on shared links
•
We will see how to compute
– Sequence of messages exchanged by processes
ONE LINK SHARED
– time to transfer a message of given size can be
computed  new total communication time
– computation and idle time unchanged
ALL LINKS SHARED
– idle time may also increase by the same factor as
communication time -- because of slowdown of
communication on other nodes
JSSPP 2003, slide 11
Methodology for Building Application’s
Sharing Performance Model
1. Execute application on a controlled testbed and
measure system level activity
– CPU and network usage
2. Analyze and construct program level activity
– such as message exchanges, synchronization waits
3. Develop sharing performance model
JSSPP 2003, slide 12
Measurement and Modeling of Communication
Goal is to capture the size and sequence of application
messages, such as MPI messages
Two approaches
1. tcpdump utility to record all TCP/network segments.
Sequence of application messages inferred by analyzing
the TCP stream (Singh & Subhlok, CCN 2002)
2. can also be done by instrumenting/profiling calls to a
message passing library
– more precise but application not a black box (access to
source or ability to link to a profiler needed)
In practice both approaches give the “correct answer”
JSSPP 2003, slide 13
Measurement and modeling of CPU activity
1. CPU status is measured at a fine grain with
•
•
top based program to probe the status of CPU (busy
or idle) at a fine grain (every 20 milliseconds)
CPU utilization data from the Unix kernel over a
specified interval of time
2. This provides the CPU busy and idle sequence for
application execution at each node
3. The CPU busy time is divided into compute and
communication time based on the time it takes to
send application messages
JSSPP 2003, slide 14
Validation
• Resource utilization of Class A/B, MPI, NAS
benchmarks measured on a dedicated testbed
• Sharing performance model developed for each
benchmark program
• Measured performance with competing loads and
limited bandwidth compared with estimates from
sharing performance model
– experiments on 500MHz Pentium Duos, 100 Mbps switched
network, TCP/IP, FreeBSD. dummynet employed to control
network bandwidth
– some new measurements for class B benchmarks on 1.7
GHz Pentium Duos with Linux (in the talk only)
JSSPP 2003, slide 15
Discovered Communication Structure of NAS
Benchmarks
0
1
0
1
0
1
2
3
2
3
2
3
BT
CG
IS
0
1
0
1
0
1
2
3
2
3
2
3
LU
MG
SP
0
1
2
3
EP
JSSPP 2003, slide 16
CPU Behavior of NAS Benchmarks
Computation
Communication
Idle
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
CG
IS
MG
SP
LU
BT
EP
JSSPP 2003, slide 17
Percentage slowdown with sharing
Predicted and Measured Performance with One
shared CPU and/or Link (4 nodes)
Shared node: Predicted
Shared link: Predicted
Shared node & link: Predicted
Measured
Measured
Measured
450
400
350
300
250
200
150
100
50
0
CG
MG
SP
LU
BT
EP
JSSPP 2003, slide 18
Shared node: Predicted
Shared link: Predicted
Shared node & link: Predicted
Measured
Measured
Measured
Predicted
Measured
160
450
400
Normalized Execution Time
Percentage slowdown with sharing
Predicted and Measured Performance with One
shared CPU
350
300
250
200
150
100
140
120
100
80
60
40
20
0
50
0
CG
CG
MG
SP
LU
BT
EP
EP
IS
LU
MG
SP
BT
Results with one CPU load
for the faster cluster/class B
benchmarks
JSSPP 2003, slide 19
Predicted and Measured Performance with All
shared CPUs or Links
Shared CPUs
Shared Links
• measured performance is generally within the bounds
• rather close to upper bound in many cases…
JSSPP 2003, slide 20
Normalized Execution Time
Predicted and Measured Performance with All
shared nodes on cluster
new cluster:
faster nodes
same network
350
300
250
200
150
100
50
0
CG
EP
IS
LU
MG
SP
BT
Benchmarks
• new cluster results closer to lower bound
• speculation: CPUs have more idles, hence more flexibility to
synchronize
JSSPP 2003, slide 21
Conclusions
• Applications respond differently to sharing, this is
important for grid scheduling
• Sharing performance model can be built by nonintrusive execution monitoring of an application
treated as a black box
• Major challenges
– prediction related to data set
– scalability to large systems ?
– other limitations…
• Is the overall approach practical for large scale grid
computing ?
JSSPP 2003, slide 22
Alternate approach: Node Selection with
Performance Skeletons (AMS 2003 tomorrow)
Data
Model
GUI
Pre
Stream
Sim 1
Data
Model
GUI
Pre
Construct a skeleton for
application (small program but
same execution behavior as
the application
Stre
am
Sim 1
Select candidate node sets
based on network status
Execute the skeleton on them
Select the node set with best
skeleton performance
JSSPP 2003, slide 23