Download Part II

PhD Course: Performance & Reliability Analysis of IP-based Communication Networks Henrik Schiøler, Hans-Peter Schwefel, Mark Crovella, Søren Asmussen • Day 1 Basics & Simple Queueing Models(HPS) • Day 2 Traffic Measurements and Traffic Models (MC) • Day 3 Advanced Queueing Models & Stochastic Control (HPS) • Day 4 Network Models and Application (HS) • Day 5 Simulation Techniques (HPS, SA) • Day 6 Reliability Aspects (HPS) Organized by HP Schwefel & H Schiøler PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04 Page 1 Hans Peter Schwefel Content 1. Basic concepts • • • Fault, failure, fault-tolerance, redundancy types Availability, reliability, dependability Life-times, hazard rates 2. Mathematical models • • Availability models, Life-time models, Repair-Models Performability 3. Availability and Network Management • Rolling upgrade, planned downtime, etc. 4. Methods & Protocols for fault-tolerant systems a) SW Fault-Tolerance b) Network Availability • IP Routing, HSRP/VRRP, SCTP c) Server/Service Availability • Server Farms, Cluster solution, distributed reliability Demonstration: Fault-tolerant call control systems PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04 Page 2 Hans Peter Schwefel Motivation: Failure Types in Communication NWs WANs • • • PSTNs IP networks designed to be ‘robust’ but not highly available End-to-end availability – Public Internet: below 99% (see e.g. http://www.netmon.com/IRI.asp). – PSTN: 99.94% Operation, Administration, and Maintenance (OAM) one major source of outage times PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04 Page 3 Hans Peter Schwefel Overview: Role of OAM OAM System High Availability Network Element HA Network HA   network design network redundancy /fail-over    NE-Interface    overcome outages Data HA backup and restore regional redundancy rolling data upgrade keep the system stable multi-homing stacks and protocols IP fail-over redundancy and diversity state replication message distribution Error Handling     detection and recovery tracing and logging escalation and alarming operational modes PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04 system startup graceful shutdown Supervision & Auditing    resource supervision health/lifetime supervision overload protection repair and replacement Fault Management HA control & verification system diagnosis Startup & Shutdown   Continuous Execution        Rolling Upgrade    rolling upgrade patch procedure migration Adapted from S. Uhle, Siemens ICM N Page 4 Hans Peter Schwefel OAM Concepts I: Startup/Upgrade/Shutdown • Startup – Put components into stable operational state – Caution: potential for high-load (synchronisation etc.) in start-up phase – Concept for start-up of larger sub-systems needed (e.g. Mutual dependence) • Upgrades – Rolling Upgrade • Allow upgrades of single components while its replicates are in operation • Consistency/data compatibility problems – Patch concept (SW) • Instead of full SW re-installation, incremental changes • Goal: zero or minimal outage of components • Graceful shutdown • HW: Hot plug/swap ability – E.g. Interface cards • Life Testing – Take components safely out of operation – Possible steps: • Stop accepting/redirect new tasks • Finish existing tasks • Synchronize data and isolate component PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04 Page 5 Hans Peter Schwefel OAM Concepts II:Monitoring/Supervision • Resource Supervision and Overload Protection – – – – • E.g. CPU load, queue-lengths, traffic volumes Alarming  operator intervention, e.g. upgrades Signalling to reduce overload at source Graceful degradation Logging & Tracing – For off-line analysis of incidents – Correlation of traces for system analysis – Problem: adequate granularity of logging data • Service concepts/contracts – – – – Reaction to alarms System recovery modes Spare-parts handling Qualified technicians, availability (24/7 ?) PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04 • High-availability of OAM system – Redundancy (frequently separate OAM network(s)) – Prioritisation of OAM traffic – Handling/Storing of OAM/logging data Page 6 Hans Peter Schwefel Content 1. Basic concepts • • • Fault, failure, fault-tolerance, redundancy types Availability, reliability, dependability Life-times, hazard rates 2. Mathematical models • • Availability models, Life-time models, Repair-Models Performability 3. Availability and Network Management • Rolling upgrade, planned downtime, etc. 4. Methods & Protocols for fault-tolerant systems a) SW Fault-Tolerance b) Network Availability • IP Routing, HSRP/VRRP, SCTP c) Server/Service Availability • Server Farms, Cluster solution, distributed reliability Demonstration: Fault-tolerant call control systems PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04 Page 7 Hans Peter Schwefel General Approaches: Fault-tolerance • Basic requirements for fault-tolerance – Number of fault-types and number of faults is bounded – Existence of redundancy (structural: equal/diversity; functional; information; timeredundancy/retries) • Functional Parts of fault-tolerant systems – Fault detection (& diagnosis) • • • • Replication and Comparison (if identical realisations  not suitable for design errors!) Inversion (e.g. Mathematical functions) Acceptance (necessary conditions on result, e.g. ranges) Timing behavior (time-outs) – Fault isolation: prevent spreading • Isolation of functional components, e.g. Atomic actions, layering model – Fault recovery • Backward: Retry, Journalling (Restart), Checkpointing & Rollback (non-trivial in distributed cooperating systems!) • Forward: move to consistent, acceptable, safe new state; but loss of result • Compensation, e.g. TMR, FEC PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04 Page 8 Hans Peter Schwefel Software fault-tolerance • Mainly: Design errors & user interaction (as opposed to production errors, wear-out, etc.) • Observations/Estimates (experience in computing centers, numbers a bit old however) – – – – 0.25...10 errors per 1000 lines of code Only about 30% of error reports by users accepted as errors by vendor Reaction times (updates/patches): weeks to months Reliability not nearly as improved as hardware errors, various reasons: • immensely increased software complexity/size • Faster HWmore operations per time-unit executed  higher hazard rate – However, relatively short down-times (as opposed to HW): ca 25 min • Approaches for fault-tolerance – Software diversity (n-version programming, back-to-back tests of components) • However, same specification often leads to similar errors • Forced diversity as improvement (?) • Majority votes for complex result types not trivial PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04 Page 9 Hans Peter Schwefel Content 1. Basic concepts • • • Fault, failure, fault-tolerance, redundancy types Availability, reliability, dependability Life-times, hazard rates 2. Mathematical models • • Availability models, Life-time models, Repair-Models Performability 3. Availability and Network Management • Rolling upgrade, planned downtime, etc. 4. Methods & Protocols for fault-tolerant systems a) SW Fault-Tolerance b) Network Availability • IP Routing, HSRP/VRRP, SCTP c) Server/Service Availability • Server Farms, Cluster solution, distributed reliability Demonstration: Fault-tolerant call control systems PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04 Page 10 Hans Peter Schwefel Network Connectivity: Basic Resilience • Loss of network connectivity – Link Failures (cable cut) – Router/component failures along the path • Basic resilience features on (almost) every protocol layer – L1+L2: FEC, Link-Layer Retransmission, Resilient Switching (e.g. Resilient Packet Ring) – L3, IP: Dynamic Routing – TCP: Retransmissions – Application: application-level retransmissions, loss-resilient coding (e.g. VoIP, Video) Application L5-7 TCP/UDP L4 IP L3 Link-Layer L2 • IP Layer Network Resilience: Dynamic Routing, e.g. OSPF – – – – ’Hello’ packets used to determine adjacencies and link-states Missing hello packets (typically 3) indicate outages of links or routers Link-states propagated through Link-state advertisements (LSA) Updated link-state information (adjacencies) lead to modified path selection PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04 Page 11 Hans Peter Schwefel Dynamic Routing: Improvements • Drawbacks of dynamic routing – Long Duration until newly converged routing tables (30sec up to several minutes) – Rerouting not possible if first router (gateway) fails • Improvements – Speed-up: Pre-determined secondary paths (e.g. Via MPLS) – ’local’ router redundancy: Hot Standby (HSRP, RFC2281) & Virtual Router Redundancy Protocol (VRRP, RFC2338) • Multiple routers on same LAN • Master performs packet routing • Fail-over by migration of ’virtual’ MAC address Virtual Router NE 1 HUB 1 Router 1 NE 2 HUB 2 Router 2 HSRP VRRP PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04 Page 12 single IPaddress of the virtual router in the network for clientTransparency Hans Peter Schwefel • • • Streaming Control Transmission Protocol (SCTP) Defined in RFC2960 (see also RFC 3257, 3286) Purpose initially: Signalling Transport Features – Reliable, full-duplex unicast transport (performs retransmissions) – TCP-friendly flow control (+ many other features of TCP) – Multi-streaming, in sequence delivery within streams  Avoid head of line blocking (performance issue) – Multi-homing: hosts with multiple IP addresses, path monitoring (heart-beat mechanism), transparent failover to secondary paths • Useful for provisioning of network reliability SCTP Association IPb1 IPa1 Host A Separate Networks IPa2 PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04 Page 13 Host B IPb2 Hans Peter Schwefel Content 1. Basic concepts • • • Fault, failure, fault-tolerance, redundancy types Availability, reliability, dependability Life-times, hazard rates 2. Mathematical models • • Availability models, Life-time models, Repair-Models Performability 3. Availability and Network Management • Rolling upgrade, planned downtime, etc. 4. Methods & Protocols for fault-tolerant systems a) SW Fault-Tolerance b) Network Availability • IP Routing, HSRP/VRRP, SCTP c) Server/Service Availability • Server Farms, Cluster solution, distributed reliability Demonstration: Fault-tolerant call control systems PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04 Page 14 Hans Peter Schwefel Resilience for state-less applications • State-less applications 1 – Server response only depends on client request – Principle in most IETF protocols, e.g. http – If at all, state stored in client, e.g. ’cookies’ • Internet Client 5 Redundancy in ’server farms’ (same LAN) – Multiple servers – IP address mapped to (replicated) ’load balancing’ units • HW load balancers • SW load balancers – Load balancers forward requests to specific Server MAC address according to Server Selection Policy, e.g. • • Random Selection • Round-Robin • Shortest Response-Time First • PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04 Firewall / Switch 2 4 Gateway Node Service Nodes Gateway Node Backup 3 Server failure – Load-Balancer does not select failed server any more Examples: Web-services, Firewalls Page 15 Hans Peter Schwefel Cluster Framework • • SW Layer across several servers within same subnetwork Functionality – Load Balancer/dispatching functionality – Node Management/ Node Supervision/Node Elimination – Network connection: IP aliasing (unsolicited/gratuitous ARP reply)  fail-over times few seconds (up to 10 minutes, if router/switch does not support unsolicited ARP!) • Single Image view (one IP address for cluster) – For communication partners – For OAM • Possible Platform: Server Blades PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04 Page 16 Hans Peter Schwefel Reliability middleware: Example RTP • Software Layer on top of cluster framework – All fault-tolerant operations are done within the cluster – One virtual IP address for the whole cluster (UDP dispatcher redirects incoming messages)  “one system image” – Transparency for the client • Redundancy at the level of processes: – – • Notion of node disappears: – • Each process is replicated among the servers of the cluster Only a faulty process is failed over Use of logical name (replicated processes can be in the same node) Middleware Functionality (Example: Reliable Telco Platform, FSC) – – – – – – Context management Reliable inter-process communication Load Balancing for Network traffic (UDP/SS7 dispatcher) Ticket Manager Events/Alarming Support of rolling upgrade PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04 Page 17 –Node Manager •Supervision of processes •Communication of process status to other nodes –Trace Manager (Debugging) –Administrative Framework (CLI, GUI, SNMP) Hans Peter Schwefel Example Architecture: Cluster middleware IN Pay Parlay/ GSM Manage- ment OSA HLR ment Third party applications Agents basic Apps-APIs Common Application Framework single image view Component Framework Reliant Telco Platform (RTP) OPS/ RAC Signal ware Net worker Corba common basic functions ... Cluster Framework (Primecluster, SUNcluster) OS (e.g. Solaris) HW OS (e.g. Solaris) HW PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04 Page 18 OS (e.g. Solaris) HW Hans Peter Schwefel Distributed Redundancy: Reliable Server Pooling • Distributed architecture Server Pool – Servers need only IP address – Entities offering same service grouped into pools accessible by pool handle • [State-sharing] PE (A) Pool User (PU) contact servers (PE) after receiving the response to a name resolution request sent to a name server (NS) PE (B) Name Server(s) Name Server(s) Fail-over – Name Server monitors PEs – Messages for dynamic registration and De-registration Name – Flat Name Space • Architecture and pool access protocols (ASAP, ENRP) defined in IETF RSerPool WG • Failure detection and fail-over performed in ASAP layer of client PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04 Pool User ASAP=Aggregate Server Access Protocol ENRP=Endpoint Name Resolution Protocol Page 19 Hans Peter Schwefel Rserpool: more details • RSerPool Scenario – Each PE contains full implementation of functionality, no distributed sub-processes (different to RTP)  reduced granularity for possible load balancing More Details: • RSerPool Name Space – Flat Name Space  easier management, performance (no recursive requests) – All name servers in operational scope hold same info (about all pools in operational scope) • Load Balancing – Load factors sent by PE to Name Server (initiated by PE) – In resolution requests, Name Server communicates load factors to PU – PU can use load factors in selection policies PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04 Page 20 Hans Peter Schwefel RSerPool Protocols: Functionality (Current status in IETF) • ENRP + State-Sharing between Name Servers • ASAP + + + + + + (De -)Registration of Pool-elements (PE, Name Server) Supervision of Pool-elements by a Name Server Name Resolution (PU, Name Server) PE selection according to policy (& load factors) (PU) Failure detection based on transport layer (e.g. SCTP timeout) Support of Fail-over to other Pool-element (PU-PE) + Business cards (pool-to-pool communication) + Last will + Simple Cookie Mechanism (usable for Pool Element state information) (PUPE) + Under discussion: Application Layer Acknowledgements (PU-PE) PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04 Page 21 Hans Peter Schwefel RserPool and ServerBlades • Server Blades – – – – Many processor cards in one 19‘‘racks  space efficiency Integrated Switch (possibly duplicated) for external communication Backplane for internal communication (duplicated) Management support: automatic configuration & code replication • Combination RSerPool on Server Blades – No additional load balancing SW necessary (but less granularity) – Works on any heterogeneous system without additional implementation effort (e.g. Server Blades + Sun Workstations) – Standardized protocol stacks, no implementation effort in clients (except for use of RSerPool API) PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04 Page 22 Hans Peter Schwefel References/Acknowledgments • • • • M. Bozinovski, T. Renier, K. Larsen: ‘Reliable Call Control’, Project reports and presentations, CPK/CTIF--Siemens ICM N, 2001-2004. Siemens ICM N, S. Uhle and colleagues Fujitsu Siemens Computers (FSC), ‘Reliable Telco Platform’, Slides. E. Jessen, ‘Dependability and Fault-tolerance of computing systems’, lecture notes (German), TU Munich, 1996. PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04 Page 23 Hans Peter Schwefel Summary 1. Basic concepts • • • Fault, failure, fault-tolerance, redundancy types Availability, reliability, dependability Life-times, hazard rates 2. Mathematical models • • Availability models, Life-time models, Repair-Models Performability 3. Availability and Network Management • Rolling upgrade, planned downtime, etc. 4. Methods & Protocols for fault-tolerant systems a) SW Fault-Tolerance b) Network Availability • IP Routing, HSRP/VRRP, SCTP c) Server/Service Availability • Server Farms, Cluster solution, distributed reliability Demonstration: Fault-tolerant call control systems PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04 Page 24 Hans Peter Schwefel Course Overview & Feedback PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04 Page 25 Hans Peter Schwefel PhD Course: Overview • Day 1 Basics & Simple Queueing Models(HPS) • Day 2 Traffic Measurements and Traffic Models (MC) • Day 3 Advanced Queueing Models & Stochastic Control (HPS) • Day 4 Network Models and Application (HS) • Day 5 Simulation Techniques (SA) • Day 6 Reliability Aspects (HPS,HS) PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04 Page 26 Hans Peter Schwefel Day 1: Basics & Simple Queueing Models 1. Intro & Review of basic stochastic concepts • • Random Variables, Exp. Distributions, Stochastic Processes, Poisson Process & CLT Markov Processes: Discrete Time, Continuous Time, ChapmanKolmogorov Eqs., steady-state solutions 2. Simple Qeueing Models • • • • • Kendall Notation M/M/1 Queues, Birth-Death processes Multiple servers, load-dependence, service disciplines Transient analysis Priority queues 3. Simple analytic models for bursty traffic • • Bulk-arrival queues Markov modulated Poisson Processes (MMPPs) 4. MMPP/M/1 Queues and Quasi-Birth Death Processes PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04 Page 27 Hans Peter Schwefel Day 2: Traffic Measurements and Models (Mark) • Morning: Performance Evaluation – – – – Marginals, Heavy-Tails Autocorrelation Superposition of flows, alpha and beta traffic Long-Range Dependence • Afternoon: Network Engineering – Single Link Analysis • Spectral Analysis, Wavelets, Forecasting, Anomaly Detection – Multiple Links • Principle Component Analysis PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04 Page 28 Hans Peter Schwefel Day 3: Advanced Queueing Models 1. Matrix Exponential (ME) Distributions • • • Definition & Properties: Moments, distribution function Examples Truncated Power-Tail Distributions 2. Queueing Systems with ME Distributions • • M/ME/1//N and M/ME/1 Queue ME/M/1 Queue 3. Performance Impact of Long-Range-Dependent Traffic • N-Burst model, steady-state performance analysis 4. Transient Analysis • Parameters, First Passage Time Analysis (M/ME/1) 5. Stochastic Control: Markov Decision Processes • • • Definition Solution Approaches Example PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04 Page 29 Hans Peter Schwefel Day 4: Network Models (Henrik) 1. Queueing Networks • • Jackson Networks Flow-balance equations, Product Form solution 2. Deterministic Network Calculus • • • • • Arrival Curves Service Curves: Concatenation, Prioritization Queue-Lengths, Output Flow Loops Examples: Leaky Buckets, PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04 Page 30 Hans Peter Schwefel Day 5: Simulation Techniques (Hans, Soeren) 1. Basic concepts • • Principles of discrete event simulation Simulation tools 2. Random Number Generation 3. Output Analysis • • Terminating simulations Non-terminating/steady state case 4. Rare Event Simulation I: Importance Sampling 5. Rare Event Simulation II: Adaptive Importance Sampling PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04 Page 31 Hans Peter Schwefel Day 6: Reliability Aspects 1. Basic concepts • • • Fault, failure, fault-tolerance, redundancy types Availability, reliability, dependability Life-times, hazard rates 2. Mathematical models • • Availability models, Life-time models, Repair-Models Performability Models 3. Availability and Network Management • Rolling upgrade, planned downtime, etc. 4. Methods & Protocols for fault-tolerant systems a) SW Fault-Tolerance b) Network Availability • IP Routing, HSRP/VRRP, SCTP c) Server/Service Availability • Server Farms, Cluster solution, distributed reliability Demonstration: Fault-tolerant call control systems PHD Course: Performance & Reliability Analysis, Day 6 Part II, Spring04 Page 32 Hans Peter Schwefel

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Part II