Download dbauer_thesis

Meta-Simulation Design and Analysis for Large Scale Networks David W Bauer Jr Department of Computer Science Rensselaer Polytechnic Institute OUTLINE  Motivation  Contributions  Meta-simulation  ROSS.Net  BGP4-OSPFv2 Investigation  Simulation  Kernel Processes  Seven O’clock Algorithm  Conclusion High-Level Motivation: to gain varying degrees of qualitative and quantitative understanding of the behavior of the system-under-test “…objective as a quest for general invariant Feature relationships between network Protocol Stability and parameters Interactions and protocol dynamics…” Dynamics Parameter Sensitivity Meta-Simulation: capabilities to extract and interpret meaningful performance data from the results of multiple simulations • Individual experiment cost is high • Developing useful interpretations • Protocol performance modeling Experiment Design Goal: identify minimum cardinality set of meta-metrics to maximally model system OUTLINE  Motivation  Contributions  Meta-simulation  ROSS.Net  BGP4-OSPFv2 Investigation  Simulation  Kernel Processes  Seven O’clock Algorithm  Conclusion Contributions: Meta-Simulation: OSPF Problem: which meta-metrics are most important in determining OSPF convergence? Step 1 Search complete model space Negligible metrics identified and isolated Re-parameterize Step 2 Our approach within 7% of Full Factorial using 2 orders of magnitude fewer experiments Optimization-based ED: 750 experiments Full-Factorial ED (FFED): 16384 experiments Step 3 Re-scale Contributions: Meta-Simulation: OSPF/BGP Ability: model BGP and OSPF control plane Problem: which meta-metrics are most important in minimizing control plane dynamics (i.e., updates)? All updates belong to one of four categories: – BO: BGP-caused OSPF update Minimize total BO+OB 15-25% than other – OB: better OSPF-caused BGP update metrics Meta-Simulation Perspective: complete view of all domains – OO: OSPF-caused OSPF (OO) update – BO: BGP-caused OSPF update OB: ~50% of total updates BO: ~0.1% of total updates Global perspective 20-25% better than local perspectives Contributions: Simulation: Kernel Process Parallel Discrete Event Simulation Conservative Simulation Wait until it is safe to process next event, so that events are processed in time-stamp order Optimistic Simulation Allow violations of time-stamp order to occur, but detect them and recover Benefits of Optimistic Simulation: i. Not dependant on network topology simulated ii. As fast as possible forward execution of events Contributions: Simulation: Kernel Process Problem: parallelizing simulation requires 1.5 to 2 times more memory than sequential, and additional memory requirement affects performance and scalability Decreased scalability as model size increases: 4 Processors Used due to increased memory required to support model Solution: Kernel Processes (KPs) new data structure supports parallelism, increases scalability Model Size Increasing Contributions: Simulation: Seven O’clock Problem: distributing simulation requires efficient global synchronization Inefficient solution: barrier synchronization between all nodes while performing computation Efficient solution: pass messages between nodes, and sycnhronize in background to main simulation Seven O’clock Algorithm: eliminate message passing  reduce cost from O(n) or O(log n) to O(1) OUTLINE  Motivation  Contributions  Meta-simulation  ROSS.Net  BGP4-OSPFv2 Investigation  Simulation  Kernel Processes  Seven O’clock Algorithm  Conclusion ROSS.Net: Big Picture Goal: an integrated simulation and experiment design environment Protocol parameters Modeling ROSS.Net (simulation & meta-simulation Protocol Models: OSPFv2, BGP4, TCP Reno, IPv4, etc Measured topology data, traffic and router stats, etc. Measurement Data-sets (Rocketfuel) Protocol metrics Protocol Design ROSS.Net: Big Picture MetaSimulation ROSS.Net – Recursive Random Search Design of Experiments Tool (DOT) Input Parameters Parallel Discrete Event Network Simulation • Experiment design • Statistical analysis • Optimization heuristic search Output Metric(s) • Sparse empirical modeling • Optimistic parallel simulation Simulation – ROSS • Memory efficient network protocol models ROSS.Net: Meta-Simulation Components Design of Experiments Tool (DOT) Design of Experiments Tool (DOT) Statistical or Regression Analysis Statistical or Regression Analysis Traditional Experiment Design Optimization Search (R, STRESS) (R, STRESS) (Full/Fractional Factorial) Parameter Vector Metric(s) Empirical model • Small-scale systems • Linear parameter interactions • Small # of params Parameter Vector Metric(s) Sparse empirical model • Large-scale systems • Non-Linear parameter interactions • Large # of params – curse of dimensionality Meta-Simulation: OSPF/BGP Interactions • Router topology from Rocketfuel tracedata – took each ISP map as a single OSPF area – Created BGP domain between ISP maps – hierarchical mapping of routers AT&T’s US Router Network Topology • 8 levels of routers: – – – – Levels 0 and 1, 155Mb/s, 4ms delay Levels 2 and 3, 45Mb/s, 4ms delay Levels 4 and 5, 1.5Mb/s, 10ms delay Levels 6 and 7, 0.5Mb/s, 10ms delay Meta-Simulation: OSPF/BGP Interactions • OSPF – Intra-domain, link-state routing – Path costs matter OSPF domain • Border Gateway Protocol (BGP) – Inter-domain, distance-vector, policy routing – Reachability matters • BGP decision-making steps: – Highest LOCAL PREF – Lowest AS Path Length – Lowest origin type ( 0 – iBGP, 1 – eBGP, 2 – Incomplete) – Lowest MED – Lowest IGP cost – Lowest router ID eBGP connectivity iBGP connectivity Meta-Simulation: OSPF/BGP Interactions • Intra-domain routing decisions can effect inter-domain behavior, and vice versa. OB Update Destination • All updates belong to either of four categories: – – – – OSPF-caused OSPF (OO) update OSPF-caused BGP (OB) update – interaction BGP-caused OSPF (BO) update – interaction BGP-caused BGP (BB) update Link failure or cost increase (e.g. maintenance) 8 10 Meta-Simulation: OSPF/BGP Interactions Intra-domain routing decisions can effect interdomain behavior, and vice versa. BO Update Identified four categories of updates: – – – – OO: BB: OB: BO: Destination OSPF-caused OSPF update BGP-caused BGP update OSPF-caused BGP update – interaction BGP-caused OSPF update – interaction eBGP connectivity becomes available These interactions cause route changes to thousands of IP prefixes, i.e. huge traffic shifts!! Meta-Simulation: OSPF/BGP Interactions • Three classes of protocol parameters: – OSPF timers, BGP timers, BGP decision • Maximum search space size 14,348,907. • RRS was allowed 200 trials to optimize (minimize) response surface: – OO, OB, BO, BB, OB+BO, ALL updates • Applied multiple linear regression analysis on the results Meta-Simulation: OSPF/BGP Interactions • • ~15% improvement when BGP Optimized with respect to OB+BO response surface. timers included in search space in the optimal BGP timers play the major role, i.e. ~15% improvement response. – BGP KeepAlive timer seems to be the dominant parameter.. – in contrast to expectation of MRAI! • OSPF timers effect little, i.e. at most 5%. – low time-scale OSPF updates do not effect BGP. Meta-Simulation: OSPF/BGP Interactions Minimize total BO+OB 15-25% better than other Important to optimize metrics OSPF • • • • Varied response surfaces -- equivalent to a particular management approach. Importance of parameters differ for each metric. OB: ~50% of total updates For minimal total updates: – Local perspectives are 20-25% worse than the global. BO: ~0.1% of total updates For minimal total interactions: – • 15-25% worse can happen with other metrics OB updates are more important than BO updates (i.e. ~0.1% vs. ~50%) Global perspective 20-25% better than local perspectives Meta-Simulation Conclusions: – Number of experiments were reduced by an order of magnitude in comparison to Full Factorial. – Experiment design and statistical analysis enabled rapid elimination of insignificant parameters. – Several qualitative statements and system characterizations could be obtained with few experiments. OUTLINE  Problem Statement  Contributions  Meta-simulation  ROSS.Net  BGP4-OSPFv2 Investigation  Simulation  Kernel Processes  Seven O’clock Algorithm  Conclusion Simulation: Overview Parallel Discrete Event Simulation – Logical Process (LPs) for each relatively parallelizable simulation model, e.g. a router, a TCP host Local Causality Constraint: Events within each LP must be processed in time-stamp order Observation: Adherence to LCC is sufficient to ensure that parallel simulation will produce same result as sequential simulation Conservative Simulation Optimistic Simulation - Avoid violating the local causality constraint (wait until it’s safe) - Allow violations of local causality to occur, but detect them and recover using a rollback mechanism I. Null Message (deadlock avoidance) (Chandy/Misra/Byrant) I. Time Warp Protocol (Jefferson, 1985) II. Time-stamp of next event II. Limiting amount of opt. execution ROSS: Rensselaer’s Optimistic Simulation System ROSS GTW tw_event PEState GState[NPE] message message PEState receive_ts event queue message src / dest_lp cancel queue user data lplist[MAX_LP] tw_lp free event list[ ][ ] pe LPState lp number process ptr message type init proc ptr proc ev queue head rev proc ptr proc ev queue tail final proc ptr ... tw_pe Event event queue lp number message cancel queue message Example Accesses GTW: Top down hierarchy lp_ptr = GState[LP[i].Map].lplist[LPNum[i]] ROSS: Bottom up hierarchy lp_ptr = event->src_lp; or pe_ptr = event->src_lp->pe; Key advantages of bottom up approach: • reduces access overheads • improves locality and processor cache performance lp_list free event list head free event list tail Memory usage only 1% more than sequential and independent of LP count. “On the Fly” Fossil Collection OTFFC works by only allocating events from the free list that are less than GVT. As events are processed they are immediately placed at the end of the free list.... LP A 5.0 5.0 LP A 5.0 10.0 10.0 10.0 15.0 15.0 15.0 LP B Snapshot of PE 0’s internal state after rollback of LP A and re-execute LP C Processor 0 FreeList[1] FreeList[0] LP C Processor 0 FreeList[1] FreeList[0] LP B Snapshot of PE 0’s internal state at time 15.0 5.0 5.0 10.0 10.0 15.0 15.0 5.0 10.0 15.0 Key Observation: Rollbacks cause the free list to become UNSORTED in virtual time. Result: event buffers that could be allocated are not. user must over-allocate the free list Contributions: Simulation: Kernel Process Fossil Collection / Rollback LP 9 5 KP LP 8 7 3 LP KP 9 4 2 PE ... Kernel (Processing Element per CPU utilized) Processes LP 6 (Logical Processes) 1 ROSS: Kernel Processes Advantages: i. significantly lowers fossil collection overheads ii. lowers memory usage by aggregation of LP statistics into KP statistics iii. retains ability to process events on an LP by LP basis in the forward computation. Disadvantages: i. potential for “false rollbacks” ii. care must be taken when deciding on how to map LPs to KPs ROSS: KP Efficiency Small trade-off: longer rollbacks vs faster FC Not enough work in system… ROSS: KP Performance Impact # KPs does not negatively impact performance ROSS: Performance vs GTW ROSS outperforms GTW 2:1 at best parallel ROSS outperforms GTW 2:1 in sequential Simulation: Seven O’clock GVT Optimistic approach – Relies on global virtual time (GVT) algorithm to perform fossil collection at regular intervals – Events with timestamp less than GVT: • Will not be rolled back • Can be freed GVT calculation – Synchronous algorithms: LPs stop event processing during GVT calculation • Cost of synch. may be higher than positive work done per interval • Processes waste time waiting – Asynchronous algorithms: LPs continue processing events while GVT calculation continues in the background * Goal: creating a consistent cut among LPs that divides the events into past and future the wall-clock time Two problems: (i) Transient Message Problem, (ii) Simultaneous Reporting Problem Simulation: Mattern’s GVT Construct cut via messagepassing Cost: O(log n) if tree, O(N) if ring ! If large number of processors, then free pool exhausted waiting for GVT to complete Simulation: Fujimoto’s GVT Construct cut using shared memory flag Cost: O(1) Sequentially consistent memory model ensures proper causal order ! Limited to shared memory architecture Simulation: Memory Model Sequentially consistent does not mean instantaneous Memory events are only guaranteed to be causally ordered Is there a method to achieve sequentially consistent shared memory in a loosely coordinated, distributed environment? Simulation: Seven O’clock GVT Key observations: – An operation can occur atomically within a network of processors if all processors observe that the event occurred at the same time. – CPU clock time scale (ns) is significantly smaller than network timescale (ms). Network Atomic Operations (NAOs): – an agreed upon frequency in wall-clock time at which some event logically observed to have happened across a distributed system. – subset of the possible operations provided by a complete sequentially consistent memory model. Update Tables Update Tables Update Tables Update Tables Update Tables Update Tables Update Tables wall-clock time Compute GVT Compute GVT Compute GVT Compute GVT Compute GVT Compute GVT Compute GVT wall-clock time GVT 7 LVT: 7 10 GVT: min(5,7) 9 LVT: min(5,9) LVT: 5 5 A B C D E Simulation: Seven O’clock GVT • • • • • • Itanium-2 Cluster r-PHOLD 1,000,000 LPs 10% remote events 16 start events 4 machines – 1-4 CPUs – 1.3 GHz • Round-robin LP to PE mapping Linear Performance Simulation: Seven O’clock GVT • • • • Netfinity Cluster r-PHOLD 1,000,000 LPs 10, 25% remote events • 16 start events • 4 machines – 2 CPUs, 36 nodes – 800 GHz Simulation: Seven O’clock GVT: TCP • Itanium-2 Cluster • 1,000,000 LPs – each modeling a TCP host (i.e. one end of a TCP connection). • 2 or 4 machines – 1-4 CPUs on each – 1.3 GHz • Poorly mapped LP/KP/PE Linear Performance Simulation: Seven O’clock GVT: TCP • Netfinity Cluster • 1,000,000 LPs – each modeling a TCP host (i.e. one end of a TCP connection). • 4-36 machines – 1-2 CPUs on each – Pentium III – 800MHz Simulation: Seven O’clock GVT: TCP • Sith Itanium-2 cluster • 1,000,000 LPs – each modeling a TCP host (i.e. one end of a TCP connection). • 4-36 machines – 1-2 CPUs on each – 900MHz Simulation: Seven O’clock GVT Summary – Seven O’Clock Algorithm • Clock-based algorithm for distributed processors – creates a sequentially consistent view of distributed memory • Zero-Cost Consistent Cut – Highly scalable and independent of event memory limits Cut Calculation Complexity Parallel / Distributed Global Invariant Independent of Event Memory Fujimoto’s Seven O’Clock Mattern’s Samadi’s O(1) O(1) O(n) or O(log n) O(n) or O(log n) P P&D P&D P&D Shared Memory Flag Clock Synchronization Message Passing Interface Message Passing Interface N Y N N Summary: Contributions Meta-simulation  ROSS.Net: platform for large-scale network simulation, experiment design and analysis  OSPFv2 protocol performance analysis  BGP4/OSPFv2 protocol interactions Simulation  kernel processes  memory efficient, large-scale simulation  Seven O’clock GVT Algorithm  zero-cost consistent cut  high performance distributed execution Summary: Future Work Meta-simulation  ROSS.Net: platform for large-scale network  incorporate more realistic measurement data, protocol models  CAIDA, Multi-cast, UDP, other TCP variants  more complex experiment designs  better qualitative analysis Simulation  Seven O’clock GVT Algorithm  compute FFT and analyze “power” of different models  attempt to eliminate GVT algorithm by determining max rollback length

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download dbauer_thesis