Download dbauer_thesis

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Emotion and memory wikipedia , lookup

Holonomic brain theory wikipedia , lookup

Collective memory wikipedia , lookup

Agent-based model in biology wikipedia , lookup

Agent-based model wikipedia , lookup

Reconstructive memory wikipedia , lookup

Transcript
Meta-Simulation Design and
Analysis for Large Scale
Networks
David W Bauer Jr
Department of Computer Science
Rensselaer Polytechnic Institute
OUTLINE
 Motivation
 Contributions
 Meta-simulation
 ROSS.Net
 BGP4-OSPFv2 Investigation
 Simulation
 Kernel Processes
 Seven O’clock Algorithm
 Conclusion
High-Level Motivation: to gain varying degrees of
qualitative and quantitative
understanding of the behavior of
the system-under-test
“…objective as a quest for general invariant
Feature
relationships
between
network
Protocol
Stability
and
parameters
Interactions
and protocol dynamics…”
Dynamics
Parameter
Sensitivity
Meta-Simulation: capabilities to extract and interpret
meaningful performance data from the results of
multiple simulations
• Individual experiment cost is high
• Developing useful interpretations
• Protocol performance modeling
Experiment Design Goal: identify minimum cardinality
set of meta-metrics to maximally model system
OUTLINE
 Motivation
 Contributions
 Meta-simulation
 ROSS.Net
 BGP4-OSPFv2 Investigation
 Simulation
 Kernel Processes
 Seven O’clock Algorithm
 Conclusion
Contributions: Meta-Simulation: OSPF
Problem: which meta-metrics are most important in
determining OSPF convergence?
Step 1
Search complete
model space
Negligible metrics
identified and isolated
Re-parameterize
Step 2
Our approach within 7% of Full Factorial using 2
orders of magnitude fewer experiments
Optimization-based ED:
750 experiments
Full-Factorial ED (FFED):
16384 experiments
Step 3
Re-scale
Contributions: Meta-Simulation: OSPF/BGP
Ability: model BGP and OSPF control plane
Problem: which meta-metrics are most important in
minimizing control plane dynamics (i.e., updates)?
All updates belong to one of four categories:
– BO: BGP-caused
OSPF update
Minimize
total BO+OB
15-25%
than other
– OB: better
OSPF-caused
BGP update
metrics
Meta-Simulation Perspective: complete view of all domains
– OO: OSPF-caused OSPF (OO) update
– BO: BGP-caused OSPF update
OB: ~50% of total updates
BO: ~0.1% of total updates
Global perspective 20-25% better
than local perspectives
Contributions: Simulation: Kernel Process
Parallel Discrete Event Simulation
Conservative Simulation
Wait until it is safe to process next
event, so that events are
processed in time-stamp order
Optimistic Simulation
Allow violations of time-stamp
order to occur, but detect
them and recover
Benefits of Optimistic Simulation:
i. Not dependant on network topology simulated
ii. As fast as possible forward execution of events
Contributions: Simulation: Kernel Process
Problem: parallelizing simulation requires 1.5 to 2 times more
memory than sequential, and additional memory requirement
affects performance and scalability
Decreased scalability as model
size increases:
4 Processors Used
due to increased memory
required to support model
Solution: Kernel Processes (KPs)
new data structure supports
parallelism, increases
scalability
Model Size Increasing
Contributions: Simulation: Seven O’clock
Problem: distributing simulation requires efficient global
synchronization
Inefficient solution: barrier synchronization between all nodes while
performing computation
Efficient solution: pass messages between nodes, and sycnhronize in
background to main simulation
Seven O’clock Algorithm: eliminate message passing  reduce cost
from O(n) or O(log n) to O(1)
OUTLINE
 Motivation
 Contributions
 Meta-simulation
 ROSS.Net
 BGP4-OSPFv2 Investigation
 Simulation
 Kernel Processes
 Seven O’clock Algorithm
 Conclusion
ROSS.Net: Big Picture
Goal: an integrated simulation and experiment
design environment
Protocol
parameters
Modeling
ROSS.Net
(simulation &
meta-simulation
Protocol Models:
OSPFv2, BGP4,
TCP Reno, IPv4, etc
Measured topology
data, traffic and
router stats, etc.
Measurement
Data-sets
(Rocketfuel)
Protocol
metrics
Protocol
Design
ROSS.Net: Big Picture
MetaSimulation
ROSS.Net
– Recursive Random
Search
Design of Experiments
Tool (DOT)
Input
Parameters
Parallel Discrete Event
Network Simulation
• Experiment design
• Statistical analysis
• Optimization heuristic
search
Output
Metric(s)
• Sparse empirical
modeling
• Optimistic parallel
simulation
Simulation
– ROSS
• Memory efficient
network protocol
models
ROSS.Net: Meta-Simulation Components
Design of Experiments Tool (DOT)
Design of Experiments Tool (DOT)
Statistical or
Regression Analysis
Statistical or
Regression Analysis
Traditional
Experiment Design
Optimization Search
(R, STRESS)
(R, STRESS)
(Full/Fractional Factorial)
Parameter
Vector
Metric(s)
Empirical model
• Small-scale systems
• Linear parameter
interactions
• Small # of params
Parameter
Vector
Metric(s)
Sparse empirical model
• Large-scale systems
• Non-Linear parameter
interactions
• Large # of params – curse of
dimensionality
Meta-Simulation: OSPF/BGP Interactions
• Router topology from
Rocketfuel tracedata
– took each ISP map as a
single OSPF area
– Created BGP domain
between ISP maps
– hierarchical mapping of
routers
AT&T’s US Router Network Topology
• 8 levels of routers:
–
–
–
–
Levels 0 and 1, 155Mb/s, 4ms delay
Levels 2 and 3, 45Mb/s, 4ms delay
Levels 4 and 5, 1.5Mb/s, 10ms delay
Levels 6 and 7, 0.5Mb/s, 10ms delay
Meta-Simulation: OSPF/BGP Interactions
• OSPF
– Intra-domain, link-state routing
– Path costs matter
OSPF domain
• Border Gateway Protocol (BGP)
– Inter-domain, distance-vector, policy routing
– Reachability matters
• BGP decision-making steps:
– Highest LOCAL PREF
– Lowest AS Path Length
– Lowest origin type
( 0 – iBGP, 1 – eBGP, 2 – Incomplete)
– Lowest MED
– Lowest IGP cost
– Lowest router ID
eBGP connectivity
iBGP connectivity
Meta-Simulation: OSPF/BGP Interactions
• Intra-domain routing decisions can
effect inter-domain behavior, and vice
versa.
OB Update
Destination
• All updates belong to either of four
categories:
–
–
–
–
OSPF-caused OSPF (OO) update
OSPF-caused BGP (OB) update – interaction
BGP-caused OSPF (BO) update – interaction
BGP-caused BGP (BB) update
Link failure or cost increase
(e.g. maintenance)
8
10
Meta-Simulation: OSPF/BGP Interactions
Intra-domain routing decisions can effect interdomain behavior, and vice versa.
BO Update
Identified four categories of updates:
–
–
–
–
OO:
BB:
OB:
BO:
Destination
OSPF-caused OSPF update
BGP-caused BGP update
OSPF-caused BGP update – interaction
BGP-caused OSPF update – interaction
eBGP connectivity
becomes available
These interactions cause route changes to thousands of
IP prefixes, i.e. huge traffic shifts!!
Meta-Simulation: OSPF/BGP Interactions
• Three classes of protocol
parameters:
– OSPF timers, BGP timers,
BGP decision
• Maximum search space size
14,348,907.
• RRS was allowed 200 trials
to optimize (minimize)
response surface:
– OO, OB, BO, BB,
OB+BO, ALL updates
• Applied multiple linear
regression analysis on the
results
Meta-Simulation: OSPF/BGP Interactions
•
•
~15% improvement when BGP
Optimized with respect to OB+BO response surface.
timers included in search
space in the optimal
BGP timers play the major role, i.e. ~15% improvement
response.
– BGP KeepAlive timer seems to be the dominant parameter.. – in contrast to
expectation of MRAI!
• OSPF timers effect little, i.e. at most 5%.
– low time-scale OSPF updates do not effect BGP.
Meta-Simulation: OSPF/BGP Interactions
Minimize total BO+OB
15-25%
better than
other
Important
to optimize
metrics
OSPF
•
•
•
•
Varied response surfaces -- equivalent to a particular management approach.
Importance of parameters differ for each metric.
OB: ~50% of total updates
For minimal total updates:
– Local perspectives are 20-25% worse than the global.
BO: ~0.1% of total updates
For minimal total interactions:
–
•
15-25% worse can happen with other metrics
OB updates are more important than BO updates (i.e. ~0.1% vs. ~50%)
Global perspective 20-25% better
than local perspectives
Meta-Simulation
Conclusions:
– Number of experiments were reduced by an
order of magnitude in comparison to Full
Factorial.
– Experiment design and statistical analysis
enabled rapid elimination of insignificant
parameters.
– Several qualitative statements and system
characterizations could be obtained with few
experiments.
OUTLINE
 Problem Statement
 Contributions
 Meta-simulation
 ROSS.Net
 BGP4-OSPFv2 Investigation
 Simulation
 Kernel Processes
 Seven O’clock Algorithm
 Conclusion
Simulation: Overview
Parallel Discrete Event Simulation
– Logical Process (LPs) for each relatively parallelizable simulation
model, e.g. a router, a TCP host
Local Causality Constraint: Events within each LP must be processed
in time-stamp order
Observation: Adherence to LCC is sufficient to ensure that parallel
simulation will produce same result as sequential simulation
Conservative Simulation
Optimistic Simulation
-
Avoid violating the local causality
constraint (wait until it’s safe)
-
Allow violations of local causality to
occur, but detect them and recover
using a rollback mechanism
I.
Null Message (deadlock avoidance)
(Chandy/Misra/Byrant)
I.
Time Warp Protocol
(Jefferson, 1985)
II.
Time-stamp of next event
II.
Limiting amount of opt. execution
ROSS: Rensselaer’s Optimistic Simulation System
ROSS
GTW
tw_event
PEState GState[NPE]
message
message
PEState
receive_ts
event
queue
message
src / dest_lp
cancel queue
user data
lplist[MAX_LP]
tw_lp
free event list[ ][ ]
pe
LPState
lp number
process
ptr
message
type
init proc ptr
proc ev queue head
rev proc ptr
proc ev queue tail
final proc ptr
...
tw_pe
Event
event queue
lp
number
message
cancel queue
message
Example Accesses
GTW: Top down hierarchy
lp_ptr =
GState[LP[i].Map].lplist[LPNum[i]]
ROSS: Bottom up hierarchy
lp_ptr = event->src_lp;
or
pe_ptr = event->src_lp->pe;
Key advantages of bottom up
approach:
• reduces access overheads
• improves locality and processor
cache performance
lp_list
free event list head
free event list tail
Memory usage only 1% more than
sequential and independent of LP count.
“On the Fly” Fossil Collection
OTFFC works by only allocating events from the free list that are less than GVT. As events are
processed they are immediately placed at the end of the free list....
LP A
5.0
5.0
LP A
5.0
10.0 10.0 10.0 15.0 15.0 15.0
LP B
Snapshot of PE 0’s
internal state after
rollback of LP A and
re-execute
LP C
Processor 0
FreeList[1]
FreeList[0]
LP C
Processor 0
FreeList[1]
FreeList[0]
LP B
Snapshot of PE 0’s
internal state at time
15.0
5.0
5.0
10.0 10.0 15.0 15.0
5.0
10.0 15.0
Key Observation: Rollbacks cause the free list to become UNSORTED in virtual time.
Result: event buffers that could be allocated are not.
user must over-allocate the free list
Contributions: Simulation: Kernel Process
Fossil Collection / Rollback
LP
9
5
KP
LP
8
7
3
LP
KP
9
4
2
PE
...
Kernel
(Processing Element
per CPU utilized)
Processes
LP 6
(Logical Processes)
1
ROSS: Kernel Processes
Advantages:
i. significantly lowers fossil collection overheads
ii. lowers memory usage by aggregation of LP statistics into KP
statistics
iii. retains ability to process events on an LP by LP basis in the forward
computation.
Disadvantages:
i. potential for “false rollbacks”
ii. care must be taken when deciding on how to map LPs to KPs
ROSS: KP Efficiency
Small trade-off:
longer rollbacks
vs faster FC
Not enough work
in system…
ROSS: KP Performance Impact
# KPs does not negatively impact performance
ROSS: Performance vs GTW
ROSS outperforms GTW
2:1 at best parallel
ROSS
outperforms GTW
2:1 in sequential
Simulation: Seven O’clock GVT
Optimistic approach
– Relies on global virtual time (GVT) algorithm to perform fossil collection at
regular intervals
– Events with timestamp less than GVT:
• Will not be rolled back
• Can be freed
GVT calculation
– Synchronous algorithms: LPs stop event processing during GVT calculation
• Cost of synch. may be higher than positive work done per interval
• Processes waste time waiting
– Asynchronous algorithms: LPs continue processing events while GVT
calculation continues in the background
* Goal: creating a consistent cut among LPs that divides the events
into past and future the wall-clock time
Two problems: (i) Transient Message Problem, (ii) Simultaneous Reporting Problem
Simulation: Mattern’s GVT
Construct cut via messagepassing
Cost: O(log n) if
tree, O(N) if ring
! If large number of
processors, then free
pool exhausted waiting
for GVT to complete
Simulation: Fujimoto’s GVT
Construct cut using shared
memory flag
Cost: O(1)
Sequentially consistent
memory model ensures
proper causal order
! Limited to shared
memory architecture
Simulation: Memory Model
Sequentially consistent
does not mean
instantaneous
Memory events are only
guaranteed to be
causally ordered
Is there a method to achieve
sequentially consistent
shared memory in a loosely
coordinated, distributed
environment?
Simulation: Seven O’clock GVT
Key observations:
– An operation can occur atomically within a network of processors if
all processors observe that the event occurred at the same time.
– CPU clock time scale (ns) is significantly smaller than network timescale (ms).
Network Atomic Operations (NAOs):
– an agreed upon frequency in wall-clock time at which some event
logically observed to have happened across a distributed system.
– subset of the possible operations provided by a complete sequentially
consistent memory model.
Update
Tables
Update
Tables
Update
Tables
Update
Tables
Update
Tables
Update
Tables
Update
Tables
wall-clock time
Compute
GVT
Compute
GVT
Compute
GVT
Compute
GVT
Compute
GVT
Compute
GVT
Compute
GVT
wall-clock time
GVT
7
LVT: 7
10
GVT: min(5,7)
9
LVT: min(5,9)
LVT: 5
5
A
B
C
D
E
Simulation: Seven O’clock GVT
•
•
•
•
•
•
Itanium-2 Cluster
r-PHOLD
1,000,000 LPs
10% remote events
16 start events
4 machines
– 1-4 CPUs
– 1.3 GHz
• Round-robin LP to
PE mapping
Linear Performance
Simulation: Seven O’clock GVT
•
•
•
•
Netfinity Cluster
r-PHOLD
1,000,000 LPs
10, 25% remote
events
• 16 start events
• 4 machines
– 2 CPUs, 36 nodes
– 800 GHz
Simulation: Seven O’clock GVT: TCP
• Itanium-2 Cluster
• 1,000,000 LPs
– each modeling a
TCP host (i.e. one
end of a TCP
connection).
• 2 or 4 machines
– 1-4 CPUs on each
– 1.3 GHz
• Poorly mapped
LP/KP/PE
Linear Performance
Simulation: Seven O’clock GVT: TCP
• Netfinity Cluster
• 1,000,000 LPs
– each modeling a
TCP host (i.e. one
end of a TCP
connection).
• 4-36 machines
– 1-2 CPUs on each
– Pentium III
– 800MHz
Simulation: Seven O’clock GVT: TCP
• Sith Itanium-2
cluster
• 1,000,000 LPs
– each modeling a
TCP host (i.e. one
end of a TCP
connection).
• 4-36 machines
– 1-2 CPUs on each
– 900MHz
Simulation: Seven O’clock GVT
Summary
– Seven O’Clock Algorithm
• Clock-based algorithm for distributed processors
– creates a sequentially consistent view of distributed memory
• Zero-Cost Consistent Cut
– Highly scalable and independent of event memory limits
Cut Calculation
Complexity
Parallel /
Distributed
Global Invariant
Independent of
Event Memory
Fujimoto’s
Seven O’Clock
Mattern’s
Samadi’s
O(1)
O(1)
O(n) or O(log n)
O(n) or O(log n)
P
P&D
P&D
P&D
Shared
Memory Flag
Clock
Synchronization
Message Passing
Interface
Message Passing
Interface
N
Y
N
N
Summary: Contributions
Meta-simulation
 ROSS.Net: platform for large-scale network simulation,
experiment design and analysis
 OSPFv2 protocol performance analysis
 BGP4/OSPFv2 protocol interactions
Simulation
 kernel processes
 memory efficient, large-scale simulation
 Seven O’clock GVT Algorithm
 zero-cost consistent cut
 high performance distributed execution
Summary: Future Work
Meta-simulation
 ROSS.Net: platform for large-scale network
 incorporate more realistic measurement data, protocol
models
 CAIDA, Multi-cast, UDP, other TCP variants
 more complex experiment designs  better qualitative
analysis
Simulation
 Seven O’clock GVT Algorithm
 compute FFT and analyze “power” of different models
 attempt to eliminate GVT algorithm by determining max rollback
length