Download A New Model for Integrated task and data parallelism

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Prognostics wikipedia , lookup

Transcript
Performance Prediction for Simple CPU and
Network Sharing
Shreenivasa Venkataramaiah
Jaspal Subhlok
University of Houston
LACSI Symposium 2002
LACSI 2002, slide 1
Distributed Applications on Networks:
Resource selection, Mapping, Adapting
Data
Model
Sim 2
Pre
Stream
Vis
Sim 1
Application
?
Network
LACSI 2002, slide 2
Resource Selection Framework
Network Model
Application Model
Measured & Forecast
Network Conditions
(Current Resource
Availability)
Predict Application
Performance under
Current Network
Conditions
subject of this paper
Focus on building
logical network maps
Resource Selection
And Scheduling
LACSI 2002, slide 3
Building “Sharing Performance Model”
Sharing Performace Model: predicts application
performance under given availability of CPU and
network resources.
1. Execute application on a
controlled testbed
–
monitor CPU and
network during
execution
router
cp
u
cp
u
cp
u
cp
u
2. Analyze to generate the sharing performance model
•
•
application resource needs determine performance
application treated as a black box
LACSI 2002, slide 4
Resource Shared Execution Principles
• Network Sharing
– sharing changes observed bandwidth and latency
– effective application level latency and bandwidth
determine time to transfer a message
• CPU Sharing
– scheduler attempts to give equal CPU time to all
processes
– a competing process is first awarded “idle time”, then
competes to get overall equal share of CPU
LACSI 2002, slide 5
CPU sharing (1 competing process)
Application using CPU
CPU idle
Competing process using CPU
dedicated
execution
CPU time slice
corresponding
progress
CPU shared
execution
time
If an application keeps the CPU 100% busy for
dedicated execution, the execution time will double on
sharing the CPU with a compute intensive process.
LACSI 2002, slide 6
CPU sharing (1 competing process)
Application using CPU
CPU idle
Competing process using CPU
dedicated execution
CPU shared execution
If CPU is mostly idle (less than 50% busy) for dedicated
execution, execution time is unchanged with CPU sharing
dedicated execution
CPU shared execution
If CPU is busy 50 – 100 % time for dedicated execution,
execution time increases between 0 and 100%
• slowdown is predictable if usage pattern is known
LACSI 2002, slide 7
Methodology for Building Application’s
Sharing Performance Model
1. Execute application on a controlled testbed and
measure system level activity
– such as CPU and network usage
2. Analyze and construct program level activity
– such as message exchanges, synchronization waits
3. Develop sharing performance model by modeling
execution in different sharing scenarios
•
This paper limited to predicting execution time with
one shared node and/or link in a cluster
LACSI 2002, slide 8
Measurement and Modeling of Communication
1. tcpdump utility to record all TCP segments exchanged
by executing nodes.
2. Sequence of application messages inferred by analyzing
the TCP stream (Singh & Subhlok, CCN 2002)
Goal is to capture the size and sequence of application
messages, such as MPI messages
– can also be done by instrumenting/profiling
– more precise but application not a black box (access
to source or ability to link to a profiler needed)
LACSI 2002, slide 9
Measurement and modeling of CPU activity
1. CPU status is measured at a fine grain with
•
•
top based program to probe the status of CPU (busy
or idle) at a fine grain (every 20 milliseconds)
CPU utilization data from the Unix kernel over a
specified interval of time
2. This provides the CPU busy and idle sequence for
application execution at each node
3. The CPU busy time is divided into compute and
communication time based on the time it takes to
send application messages
LACSI 2002, slide 10
Prediction of Performance with Shared CPU
and Communication Link
•
It is assumed that we know:
– load on the shared node
– expected latency and bandwidth on shared link
•
Execution time for every computation phase and
time to transfer every message can be computed
 estimate of overall execution time
LACSI 2002, slide 11
Validation
• Resource utilization of Class A, MPI, NAS
benchmarks measured on a dedicated testbed
• Sharing performance model developed for each
benchmark program
• Measured performance with competing loads and
limited bandwidth compared with estimates from
sharing performance model
(All measurements presented are on 500MHz Pentium Duos, 100
Mbps network, TCP/IP, FreeBSD. dummynet employed to
control network bandwidth)
LACSI 2002, slide 12
Discovered Communication Structure of NAS
Benchmarks
0
1
0
1
0
1
2
3
2
3
2
3
BT
CG
IS
0
1
0
1
0
1
2
3
2
3
2
3
LU
MG
SP
0
1
2
3
EP
LACSI 2002, slide 13
CPU Behavior of NAS Benchmarks
Computation
Communication
Idle
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
CG
IS
MG
SP
LU
BT
EP
LACSI 2002, slide 14
Percentage slowdown with sharing
Predicted and Measured Performance with
Resource Sharing
Shared node: Predicted
Shared link: Predicted
Shared node & link: Predicted
Measured
Measured
Measured
450
400
350
300
250
200
150
100
50
0
CG
MG
SP
LU
BT
EP
LACSI 2002, slide 15
Conclusions (2 more slides though)
• Sharing performance model can be built by nonintrusive execution monitoring of an application
treated as a black box; it can estimate performance
with simple sharing fairly well
• Major challenges
– Prediction related to data set – hope is that resource
selection may still be good even if the estimates are off
– Prediction with traffic on all links and computation
loads on all nodes ?
– Is the overall approach practical for large scale grid
computing ?
LACSI 2002, slide 16
Sharing of resources on multiple nodes and
links
• Impact of sharing can be estimated on individual
nodes…
• but impact on overall execution is difficult to model
because of the combination of
– synchronization waits with unbalanced execution
– independent scheduling (lack of gang scheduling)
(e.g., one node is ready to communicate but the other is
swapped out due to independent scheduling…)
Preliminary result: lack of gang scheduling has a
modest overhead (~5-40%) for small clusters (upto
~20 processors), not an order of magnitude overhead
LACSI 2002, slide 17
Scalability of Shared Performance Models
i.e., is the whole idea of using network measurement
tools and application info to make resource selection
decisions practical ?
• Jury is still out
• Alternate approach being studied is…
– automatically build an execution skeleton  short
running program that reflects the execution behavior
of an application
– performance of skeleton is a measure of full
application performance – run it for estimation of
performance on a given network
LACSI 2002, slide 18