Download slides 3

February 11th 2010, APAN 29 – perfSONAR Workshop Jeff Boote, Assistant Director R&D, Internet2 perfSONAR Use Cases Outline • • • • • Motivation How it Should Work How it Probably Works Identifying Common Network Problems Use Cases – – – – Cisco Telepresence Georgetown International Campus USAtlas REDDnet 2 – 5/25/2017, © 2009 Internet2 Motivation • Now that we have seen the purpose and makeup of the perfSONAR infrastructure, it’s time to see what it can do in the real world • perfSONAR is used by network engineers to identify many types of performance problem – A Divide and Conquer strategy is necessary to isolate problems – A structured methodology helps to eliminate duplicate or useless steps – perfSONAR works best when everyone participates, holes in deployment lead to holes in the problem solving phase • The following sections will outline the proper deployment strategy and describe some real work use cases 3 – 5/25/2017, © 2009 Internet2 How it Should Work • To accurately and swiftly address network performance problems the following steps should be undertaken – Identify the problem: if there a user in one location is complaining about performance to another, get as much information as possible • Is the problem un-directional? Bi-directional? • Does the problem occur all the time, frequently, or rarely? • Does the problem occur for only a specific application, many applications, or only some applications? • Is the problem reproducible on other machines? – Gather information about the environment • • • • Hosts Network Path Configuration (where applicable) Resources available 4 – 5/25/2017, © 2009 Internet2 How it Should Work • Cont. – Methodically approach the problem • Test using the same tool everywhere, gather results • Before moving on to the next tool, did you gather everything of value? • Are the results consistent? – After proceeding through all tools and approaches, form theories • Can the problem be isolated to a specific resource or component? • Can testing be performed to eliminate dead ends? • Consider the following example: – – – – International path Problems noted We know the path We have tools available 5 – 5/25/2017, © 2009 Internet2 Scenario: Multi-domain International Path 6 – 5/25/2017, © 2009 Internet2 Desirable Case: Expected Performance 7 – 5/25/2017, © 2009 Internet2 Typical: Poor Performance … Somewhere 8 – 5/25/2017, © 2009 Internet2 Typical: Poor Performance … Somewhere But where? 9 – 5/25/2017, © 2009 Internet2 Solution: Test Points + Regular Monitoring 10 – 5/25/2017, © 2009 Internet2 perfSONAR: Backbone and Exchanges 11 – 5/25/2017, © 2009 Internet2 perfSONAR: Regional Networks 12 – 5/25/2017, © 2009 Internet2 perfSONAR: Campus 13 – 5/25/2017, © 2009 Internet2 Path Decomposition – Isolate the Problem Step by step: test between points 14 – 5/25/2017, © 2009 Internet2 Path Decomposition – Isolate the Problem 1st Segment - no problems found 15 – 5/25/2017, © 2009 Internet2 Path Decomposition – Isolate the Problem 2nd Segment – Problem Identified … 16 – 5/25/2017, © 2009 Internet2 Path Decomposition – Isolate the Problem 2nd Segment – Problem Identified … and fixed! 17 – 5/25/2017, © 2009 Internet2 Path Decomposition – Isolate the Problem But end to end performance still poor 18 – 5/25/2017, © 2009 Internet2 Path Decomposition – Isolate the Problem 3rd Segment – No problems 19 – 5/25/2017, © 2009 Internet2 Path Decomposition – Isolate the Problem 4th Segment – No problems 20 – 5/25/2017, © 2009 Internet2 Path Decomposition – Isolate the Problem 5th Segment – Last problem found … 21 – 5/25/2017, © 2009 Internet2 Path Decomposition – Isolate the Problem 5th Segment – Last problem found … and fixed! 22 – 5/25/2017, © 2009 Internet2 Lessons Learned • Problem resolution requires proper tools – Specialized to given task (e.g. Bandwidth, Latency) – Widely available where the problems will be • Isolating a problem is a well defined, multi-step process – Rigid set of steps – systematic approach to prevent causing new problems • Diagnostics, as well as regular monitoring, can reveal true network performance 23 – 5/25/2017, © 2009 Internet2 How it Probably Works • If the suggested steps aren’t taken (or followed in an ad-hoc manner), results will vary. – Skipping steps leads to missing clues • Deployment and participation may vary, this leads to some gaps in the debugging process • Consider the following example: – – – – International path Problems noted We know the path We have tools available - almost everywhere 24 – 5/25/2017, © 2009 Internet2 Scenario: Multi-domain International Path 25 – 5/25/2017, © 2009 Internet2 Desirable Case: Expected Performance 26 – 5/25/2017, © 2009 Internet2 Typical: Poor Performance … Somewhere 27 – 5/25/2017, © 2009 Internet2 Typical: Poor Performance … Somewhere But where? 28 – 5/25/2017, © 2009 Internet2 Solution: Test Points + Regular Monitoring 29 – 5/25/2017, © 2009 Internet2 Solution: Test Points + Regular Monitoring Key Point: End to end monitoring Requires participation from all domains 30 – 5/25/2017, © 2009 Internet2 Typical: Poor Performance … Somewhere Internet2 – Available on the backbone 31 – 5/25/2017, © 2009 Internet2 Typical: Poor Performance … Somewhere The Campus is participating too 32 – 5/25/2017, © 2009 Internet2 Typical: Poor Performance … Somewhere The exchange point makes statistics available 33 – 5/25/2017, © 2009 Internet2 Typical: Poor Performance … Somewhere A regional network may not participate… 34 – 5/25/2017, © 2009 Internet2 Typical: Poor Performance … Somewhere Complete end to end Monitoring is not possible. 35 – 5/25/2017, © 2009 Internet2 Lessons Learned • Missing part of the path leaves us with a huge disadvantage • May discover some problems through isolation on the path we know, could miss something – Most network problems occur on the demarcation between networks – Testing around the problem won’t work (we still have to transit this network) 36 – 5/25/2017, © 2009 Internet2 Why is Science Data Movement Different? • Different Requirements – Campus network is not designed for large flows • • • • Enterprise requirements 100s of Mbits is common, any more is rare (or viewed as strange) Firewalls Network is designed to mitigate the risks since the common hardware (e.g. Desktops and Laptops) are un-trusted – Science is different • Network needs to be robust and stable (e.g. predictable performance) • 10s of Gbits of traffic (N.B. that its probably not sustained – but could be) • Sensitive to enterprise protections (e.g. firewalls, LAN design) • Fixing is not easy – Design the base network for science, attach the enterprise on the side (expensive, time consuming, and good luck convincing your campus this is necessary…) – Mitigate the problems by moving your science equipment to the edge • Try to bypass that firewall at all costs • Get as close to the WAN connection as you can 37 – 5/25/2017, © 2009 Internet2 Identifying Common Network Problems • The above examples paint a broad picture: there is a problem, somewhere, that needs to be fixed • What could be out there? • Architecture • Common Problems, e.g. “Soft Failures” • Myths and Pitfalls • Getting trapped is easy • Following a bad lead is easy too 38 – 5/25/2017, © 2009 Internet2 Identifying Common Network Problems • Audience Question: Would you complain if you knew what you were getting was not correct? • N.B. Actual performance between Vanderbilt University and TACC – Should be about 1Gbps in both directions. 39 – 5/25/2017, © 2009 Internet2 Identifying Common Network Problems • Internet2/ESnet engineers will help members and customers debug problems if they are escalated to us – Goal is to solve the entire problem – end to end – Involves many parties (typical: End users as well as Campus, Regional, Backbone staff) – Slow process of locating and testing each segment in the path – Have tools to make our job easier (more on this later) • Common themes and patterns for almost every debugging exercise emerge – Architecture (e.g. LAN design, Equipment Choice, Firewalls) – Configuration – “Soft Failures”, e.g. something that doesn’t severe connectivity, but makes the experience unpleasant 40 – 5/25/2017, © 2009 Internet2 Architectural Considerations • LAN vs WAN Design – Multiple Gbit flows [to the outside] should be close to the WAN connection – Eliminate the number of hops/devices/physical wires that may slow you down – Great performance on the LAN != Great performance on the WAN • You Get What you Pay For – Cheap equipment will let you down – Network • Small Buffers, questionable performance (e.g. internal switching fabric can’t keep up w/ LAN demand let alone WAN) • Lack of diagnostic tools (SNMP, etc.) – Storage • Disk throughput needs to be high enough to get everything on to the network • Plunking a load of disk into an incapable server is not great either – Bus performance – Network Card(s) 41 – 5/25/2017, © 2009 Internet2 Architectural Considerations – cont. • Firewalls – Designed to stop traffic • read this slowly a couple of times… – Small buffers • Concerned with protecting the network, not impacting your performance – Will be a lot slower than the original wire speed – A “10G Firewall” may handle 1 flow close to 10G, doubtful that it can handle a couple. – If firewall-like functionality is a must – consider using router filters instead 42 – 5/25/2017, © 2009 Internet2 Configuration • Host Configuration – – – – Tune your hosts (especially compute/storage!) Changes to several parameters can yield 4 – 10X improvement Takes minutes to implement/test Instructions: http://fasterdata.es.net/tuning.html • Network Switch/Router Configuration – Out of the box configuration may include small buffers – Competing Goals: Memory is expensive – and the amount of interface buffer is not always clear from looking at hardware configurations. Science flows need large buffers to push more data into the network (TCP windows are dependent upon RTT and bottleneck capacity – TCP is bursty). – Read your manuals and test LAN host to a WAN host to verify (not LAN to LAN). 43 – 5/25/2017, © 2009 Internet2 Configuration – cont. • Host Configuration – spot when the settings were tweaked… • N.B. Example Taken from REDDnet (UMich to TACC), using BWCTL measurement) 44 – 5/25/2017, © 2009 Internet2 Soft Failures • Soft Failures are any network problem that does not result in a loss of connectivity – Slows down a connection – Hard to diagnose and find – May go unnoticed by LAN users in some cases, but remote users may be the ones complaining (TCP recovers faster with smaller RTTs) • Caveat – How much time/energy do you put into listening to complaints of remote users? • Common: – – – – Dirty or Crimped Cables Failing Optics/Interfaces [Router] Process Switching, aka “Punting” Router Configuration (Buffers/Queues) 45 – 5/25/2017, © 2009 Internet2 Soft Failures – cont. • Dirty or Crimped Cables and Failing Optics/Interfaces – Throw off very low levels of loss – may not notice on a LAN, will notice on the WAN – Will be detected with passive tools (e.g. SNMP monitoring) – Question: Would you fix it if you knew it was broken? • [Router] Process Switching – “Punt” traffic to a slow path • Router Configuration (Buffers/Queues) – Need to be large enough to handle science flows – Routing table overflow (e.g. system crawls to a halt when memory is exhausted) 46 – 5/25/2017, © 2009 Internet2 Soft Failures – cont. • Identifying and Fixing should be done through the use of monitoring and diagnostic tools – Establish testing points on the network • On the edge and in the center – Test to WAN points to find hidden/hard to diagnose problems – Where to Place and how to find? – Have collaborators co-allocate a testing machine – Use discovery tools to find them (e.g. perfSONAR) – Use an array of tools for different characteristics • • • • Latency (One wan and Round Trip) Bandwidth Interface Utilization/Discards/Errors Active vs Passive Testing 47 – 5/25/2017, © 2009 Internet2 Myths and Pitfalls • “My LAN performance is great, WAN is probably the same” – TCP recovers from loss/congestion quickly on the LAN (low RTT) – TCP will cut speed in half for every loss/discard on the WAN – will take a long time to recover for a large RTT/ – Small levels of loss on the LAN (ex. 1/1000 packets) will go unnoticed, will be very noticeable on the WAN. • “Ping is not showing loss/latency differences” • ICMP not usually sent at intensities of real data flows – ICMP May be blocked/ignored by some sites – Routers process ICMP differently than other packets (e.g. may show phantom delay) – ICMP may hide some (not all) loss. – Will not show asymmetric routing delays (e.g. taking a different path on send vs receive) • Our goal is to dispel these and others by educating the proper way to verify a network – we have lots of tools at our disposal but using these in the appropriate order is necessary too 48 – 5/25/2017, © 2009 Internet2 Use Cases • The following use cases demonstrate use of perfSONAR tools to solve sometimes complex performance problems – Cisco Telepresence • Multi-domain path where performance guarantees dictate use of a specific application – Georgetown International Campus • Assuring quality, from one end of the world to another – USAtlas • Enabling Big Science through diagnostic checks and regular monitoring – REDDnet • Assuring clean paths for datamovement 49 – 5/25/2017, © 2009 Internet2 Use Cases - Cisco 50 – 5/25/2017, © 2009 Internet2 Cisco TelePresence Demo • 2 Locations • Harvard University (Boston, MA) • Spring Member Meeting (Arlington, VA) • Must meet or exceed performance expectations • < 10 ms Jitter (Packet Arrival Variation) • < 160 ms End-to-End Delay • < 0.05% Packet Loss • Network Path spanned: • ~450 Miles • 4 Distinct Domains • • • • Internet2 Mid Atlantic Crossroads (MAX) Northern Crossroads (NOX) Harvard University 51 – 5/25/2017, © 2009 Internet2 Demonstration Overview Internet2 POP Harvard Northern Crossroads Mid-Atlantic Crossroads Hotel 52 – 5/25/2017, © 2009 Internet2 Cisco TelePresence Demo • Performance Monitoring – Tools installed within each domain • pS Performance Toolkit – Bootable CD with performance tools installed and configured – Interested in several ‘metrics’ • One Way Delay – OWAMP • Network Utilization – SNMP • Several Problems Found (And Corrected) – Over-utilized Link – Traffic Spikes from Cross Traffic 53 – 5/25/2017, © 2009 Internet2 Over-utilized Link • Tools indicated high amounts of end-to-end Jitter: • Goal: Isolate which segment (or segments) to examine further. 54 – 5/25/2017, © 2009 Internet2 High Jitter – But Where? Internet2 POP Harvard Northern Crossroads Internet2 POP Mid-Atlantic Crossroads Hotel 55 – 5/25/2017, © 2009 Internet2 Over-utilized Link • Process: – Tools are installed and available in each domain – ‘Decompose’ the entire end-to-end path, and examine the performance between testing points: • Meeting Hotel to NOX • Meeting Hotel to Internet2 (New York) • Meeting Hotel to Internet2 (Washington) • Meeting Hotel to MAX 56 – 5/25/2017, © 2009 Internet2 Over-utilized Link • Meeting Hotel to NOX 57 – 5/25/2017, © 2009 Internet2 Still Seen on Shorter Path Internet2 POP Harvard Northern Crossroads Internet2 POP Mid-Atlantic Crossroads Hotel 58 – 5/25/2017, © 2009 Internet2 Over-utilized Link • Meeting Hotel to Internet2 (New York) 59 – 5/25/2017, © 2009 Internet2 Still Seen on Shorter Path Internet2 POP Harvard Northern Crossroads Internet2 POP Mid-Atlantic Crossroads Hotel 60 – 5/25/2017, © 2009 Internet2 Over-utilized Link • Meeting Hotel to Internet2 (Washington) 61 – 5/25/2017, © 2009 Internet2 Still Seen on Shorter Path Internet2 POP Harvard Northern Crossroads Internet2 POP Mid-Atlantic Crossroads Hotel 62 – 5/25/2017, © 2009 Internet2 Over-utilized Link • Meeting Hotel to MAX 63 – 5/25/2017, © 2009 Internet2 Clean Between Hotel and MAX Problem is isolated between MAX and Harvard Internet2 POP Harvard Northern Crossroads Internet2 POP Mid-Atlantic Crossroads Hotel 64 – 5/25/2017, © 2009 Internet2 Over-utilized Link • Results of 1st Round of Debugging – OWAMP Confirms that the path is ‘clean’ between the Hotel and MAX. – The path is ‘noisy’ between MAX and Harvard (could be anywhere – we only know where it isn’t) • Action Plan – Use other resource available, Utilization, to see if there is a ‘pinch point’ on one of the links. – Isolate our search to areas between MAX and Harvard – Start at MAX 65 – 5/25/2017, © 2009 Internet2 Over-utilized Link • Starting in the MAX domain, we know of 4 links: – Hotel to College Park MD – MAX Core in College Park MD – College Park MD to McLean VA – Internet2 Uplink in McLean VA • Get information on each link: – 1G from Hotel to College Park MD – 10G MAX Core and transit to McLean VA – 2.5G Uplink to Internet2 in McLean VA 66 – 5/25/2017, © 2009 Internet2 Over-utilized Link • Utilization on Internet2 Uplink from MAX: 67 – 5/25/2017, © 2009 Internet2 Over-utilized Link • 2nd Round Debugging Results: – ‘Pinch Point’ found: traffic was coming very close to 2.5G limit – Not constant – but noticeable during network busy hours – ‘Pinch Point’ corrected (e.g. 2.5G uplink replaced with 10G uplink) – All other segments of the path appeared clean – Further end-to-end testing after upgrade revealed no additional problems. • Epilogue – Problem was only identified via access to performance tools – Necessary to have tools at each segment to truly isolate the problem 68 – 5/25/2017, © 2009 Internet2 Cross Traffic • A second issue was sporadic, spikes of Jitter on certain segments: 69 – 5/25/2017, © 2009 Internet2 Cross Traffic • Isolated to 2 places on Internet2 Backbone (Washington and New York): 70 – 5/25/2017, © 2009 Internet2 Cross Traffic • Observations – Events were not directly related – Were sporadic, but seemingly periodic (every 12 hours) – Theories • Experiments on the Internet2 Network • Large Research Flows • Equipment Related (e.g. CPU/Memory use on testing machines) • Action plan was to use Utilization data (again) 71 – 5/25/2017, © 2009 Internet2 Cross Traffic • Observed between New York and Chicago: 72 – 5/25/2017, © 2009 Internet2 Cross Traffic • Observed between Washington and Chicago 73 – 5/25/2017, © 2009 Internet2 Cross Traffic • Digging deeper, it was another host on the same network: 74 – 5/25/2017, © 2009 Internet2 Cross Traffic • Debugging Results: – A host on the same network (a 10G bandwidth test host) was sending traffic – This traffic was not isolated properly (shared a VLAN) • Solution: – Disable testing (temporary) – Traffic engineer the hosts to prevent in the future • Epilogue – Different types of measurement strategies will help debug complex problems 75 – 5/25/2017, © 2009 Internet2 Use Cases - Georgetown 76 – 5/25/2017, © 2009 Internet2 Georgetown Overseas Campus • Georgetown University Campuses – Washington DC, USA – Doha, Qatar • Network access maintained between the two – Provided via the Qatar Foundation Network – Peering with Internet2 in New York (MANLAN Exchange Point) • Goal of providing high quality audio and video for distance education between the sites. 77 – 5/25/2017, © 2009 Internet2 Georgetown Overseas Campus • Problem Statement – Intermittent network problems affect video – Local IT staff wanted help in analyzing performance – before and after a proposed network upgrade • Performance Monitoring Solution – perfSONAR Tools (OWAMP) to analyze the connection – Internet2 staff to assist in installation, configuration and analysis 78 – 5/25/2017, © 2009 Internet2 Georgetown Overseas Campus • Network Path – Georgetown (Washington) to MAX – MAX to Internet2 Backbone – Internet2 (MANLAN) to QF Network – QF Network to Georgetown (Doha) • Proposed Monitoring – 1st step is New York to Doha – Installed and configured OWAMP for regular testing 79 – 5/25/2017, © 2009 Internet2 Georgetown Overseas Campus • New York to Doha: Observed Jitter 80 – 5/25/2017, © 2009 Internet2 Georgetown Overseas Campus • Doha to New York: Observed Jitter 81 – 5/25/2017, © 2009 Internet2 Georgetown Overseas Campus • Interpretation – Congestion on the (shared) QF Link – Observed between 12am and 8am (Eastern) – Congestion caused sporadic losses and jitter – but still within the boundaries to allow some forms of video and audio • Routing Observations – All traffic from Doha seems to route back through MANLAN – Causes measurement abnormalities (e.g. timekeeping is affected due to network delay) 82 – 5/25/2017, © 2009 Internet2 Georgetown Overseas Campus • Observed NTP Performance 83 – 5/25/2017, © 2009 Internet2 Georgetown Overseas Campus • Interpretation – The NTP (Network Time Protocol) has a hard time adjusting to the network delay – Routing changes may correct this behavior • Current Status – 1st Stage of testing complete – Awaiting link upgrade before starting the 2nd stage of testing – May consider expanding monitoring to more portions of the path 84 – 5/25/2017, © 2009 Internet2 Use Cases - USAtlas 85 – 5/25/2017, © 2009 Internet2 USATLAS Use Case • 2007 – USATLAS decided as a group to evaluate 2nd generation perfSONAR CD (e.g. NPToolkit) as a testing and monitoring framework • Each Tier2 facility and the Tier1 Purchased 2 servers – – – – – “Koi Computers” – 1U Chasis Dual Core Pentium 2.2GHz Processor 2GB Ram 160GB Hard Drive ~$1200 for both 86 – 5/25/2017, © 2009 Internet2 USATLAS Use Case • 2009 – All sites still on original hardware, running 3rd generation (3.1 and 3.1.1) of the pS Performance Toolkit • Testing – BWCTL • Test in a “full mesh” to all Tier2s and the Tier1 • 20 Second Throughput Tests, once every 4 Hours • May adjust schedule based on how much of traffic is observed to be measurements – OWAMP • Test in a “full mesh” to all Tier2s and the Tier1 • Continuous stream of 1 minute OWAMP tests (10 packets per second – 600 total per minute). • Determine min/max latency, loss, and “jitter” (delay variation) – PingER • Not mandatory – but should test to “full mesh” of Tier2s and to the Tier1 87 – 5/25/2017, © 2009 Internet2 USATLAS Use Case • Machine Allocation – 2 Per site – Placed near other Tier2 equipment (e.g. temperature controlled and close to the WAN gear) – Latency Testing Host • OWAMP Tests • PingER Tests • SNMP Monitoring – Bandwidth Testing Host • BWCTL Tests • NDT Diagnostics • NPAD Diagnostics 88 – 5/25/2017, © 2009 Internet2 USATLAS Use Case • Procedures – Takes about 2 weeks to upgrade when there is a new ISO – 1 – 2 Weeks to establish the initial testing parameters and set up regular tests • Set up boxes first so they can be “found” by the perfSONAR tools • Set up the testing meshes (each site tests to all others). – Weekly calls (most times with an Internet2/ESnet engineer) to evaluate the performance they are seeing and request any enhancements + report bugs regarding the ISO – Each site will coordinate with others to debug perceived problems 89 – 5/25/2017, © 2009 Internet2 USATLAS Use Case • Uses • Regular BWCTL/OWAMP data is viewed daily by site and USAtlas admins for abnormalities • Used in conjunction with GridFTP data and other forms of throughput testing • Diagnostic tools (NPAD/NDT) are used by Tier2 and Tier3 participants to diagnose problems from end site to USAtlas data repositories 90 – 5/25/2017, © 2009 Internet2 USATLAS Use Case • Problems Found • Throughput problem between Tier1 and Tier2 • Couldn’t exceed 1 Gbps across a 10GE end to end path that included 5 administrative domains • Used perfSONAR tools to localize problem • Identified problem device – An unrelated domain had leaked a full routing table to the router for a short time causing FIB corruption. The routing problem was fixed, but router started process switching some flows after that. • Fixed it – Rebooting device fixed the symptoms of the problem – Better BGP filters on that peer will prevent reoccurrence (of 1 cause of this particular class of soft faults) • Loss events inbound to a particular Tier2 • Gave a quick reason to longstanding bandwidth problem • Corrected quickly once there was proof of loss 91 – 5/25/2017, © 2009 Internet2 USATLAS Use Case • Future – Worrying about Tier3s – evaluating new ISO release before recommending – Tier3s may not want 2 servers • Purchase at least one for diagnostics – occasional testing – Tier3s could not do a full mesh of testing with Tier2s and Tier3s (too much traffic) – KOI machines may be replaced with a comparable piece of hardware 92 – 5/25/2017, © 2009 Internet2 Use Cases - REDDnet 93 – 5/25/2017, © 2009 Internet2 REDDnet Use Case • REDDnet (Research and Education Data Depot network) is an NSF-funded infrastructure project designed to provide a large distributed storage facility for data intensive collaboration among the nation's researchers and educators in a wide variety of application areas. • Its mission is to provide "working storage" to help manage the logistics of moving and staging large amounts of data in the wide area network, e.g. among collaborating researchers who are either trying to move data from one collaborator (person or institution) to another or who want share large data sets for limited periods of time (ranging from a few hours to a few months) while they work on it. 94 – 5/25/2017, © 2009 Internet2 REDDnet Use Case • Fall 2008 – REDDnet and Partners University of Delaware, University of Tennessee Knoxville, and Vanderbilt University enter the SC08 Bandwidth Challenge – Utilize resources on the Internet2 backbone to move large science data sets from REDDnet storage to remote compute facilities • Use Phoebus to speed up data movement • Use perfSONAR for monitoring – Patterns emerged during the ramp up to the BWC: • Performance in/out of REDDnet locations was severely limited • No form of performance monitoring available • No access to diagnostic tools 95 – 5/25/2017, © 2009 Internet2 REDDnet Use Case • Early/Mid 2009: – Work with Internet2 engineers to formulate a diagnostic and monitoring plan • Tune all REDDnet hosts for WAN performance • Install tools on all REDDnet depots (BWCTL/OWAMP/NDT client) • Configure a central host to act as a database of monitoring information • Perform regular BWCTL/OWAMP tests between one machine at each of the 10 Locations • Collect SNMP statistics where available – Host a Network Performance Workshop to educate NOC staff – Interpret the results of the regular monitoring • Identify the ‘slow’ spots based on observed BWCTL results • Identify infrastructure faults (e.g. loss, excessive delay) based on OWAMP results • Work with Remote hands to identify and correct problems 96 – 5/25/2017, © 2009 Internet2 REDDnet Use Case – Found Problems • • • • • • Poorly Tuned Hosts Asymmetric Routing Switch/Router with Dirty Optics Mis-configured Switch Hardware Capability Malfunctioning Switch 97 – 5/25/2017, © 2009 Internet2 REDDnet Use Case – Hosts • LAN flows are much different than WAN flows. The TCP settings of a Linux kernel are normally tuned more for LAN performance. – Smaller buffers can’t drive a long flow at a high rate of speed – Tradeoff: how much Kernel memory space to reserve for TCP vs other needs – Interface queue is small to prevent long lines from growing • Auto-tuning is helping, but not to the level science users need • Instructions: http://fasterdata.es.net/tuning.html – Changes to several parameters can yield 4 – 10X improvement – Takes minutes to implement/test 98 – 5/25/2017, © 2009 Internet2 REDDnet Use Case – Hosts • Host Configuration – spot when the settings were tweaked… • N.B. Example Taken from REDDnet (UMich to TACC, using BWCTL measurement) 99 – 5/25/2017, © 2009 Internet2 REDDnet Use Case – Routing • A tale of two paths, Vanderbilt and University of Michigan are the start and desination, but how to travel between the two? – VU  SOX  Internet2  Ultralight  UM – VU  SOX  NLR  Ultralight  UM • Asymmetric Routing: – The send and receive paths may not be the same – Normally based on policy on one end or the other (prefer a certain route, Hot Potato vs Cold Potato) • Performance Implications: – Different return vs send path changes the traffic profile – Queuing on one network may not be present on the other (e.g. is it worse to queue data packets or acknowledgement packets) – Path loss on one could not happen on the other; latency would differ greatly. 100 – 5/25/2017, © 2009 Internet2 REDDnet Use Case – Dirty Optics • Dirty Optics should through off some form of error if monitored via SNMP (e.g. CRC errors) • Would only be present for one direction of a transfer: – Assume the ‘in’ direction is dirty. – If we are sending data, what happens? • Data packets are flowing out of the router, acknowledgements are flowing in. • Dropping ACKs is not so bad for TCP (cumulative ACKs), performance issue may not be noticed. – If we are receiving data, what happens? • Data packets are flowing into the router, acknowledgements are flowing out. • Dropping data is very bad – all have to be recent. For a WAN flow this can cut performance by a lot. – E.g. our ‘downloaders’ are experiencing problems. Would we fix the problem? – What if the direction was reversed… 101 – 5/25/2017, © 2009 Internet2 REDDnet Use Case – Switch Configuration • Observed at UMich/Ultralight – Couldn’t exceed 1 Gbps across a 10GE end to end path that included 5 administrative domains – Used perfSONAR tools to localize problem – Identified problem device • An unrelated domain had leaked a full routing table to the router for a short time causing FIB corruption. The routing problem was fixed, but router started process switching some flows after that. – Fixed it • Rebooting device fixed the symptoms of the problem • Better BGP filters on that peer will prevent reoccurrence (of 1 cause of this particular class of soft faults) 102 – 5/25/2017, © 2009 Internet2 REDDnet Use Case – Hardware • Older switch used to interconnect several REDDnet servers – Single 10G Fiber uplink – Multiple 1G Copper ports • Experienced problems when multiple servers tried to exceed 1G offsite • LAN testing revealed no problems • Mitigating Factors – Small loss rates on the uplink – Switching fabric tried to be fair and limit everyone – Lack of SNMP meant this was only found after logging on and viewing switch manually • Moral: You get what you pay for. A free switch that is more than 5 years old is not worth the price. 103 – 5/25/2017, © 2009 Internet2 REDDnet Use Case – Switch Fault • Transfers into and out of Vanderbilt. • Low levels of periodic loss (bursty loss, every couple of seconds). • Isolate the problem: – End to end – Vanderbilt to Univ of Florida. Loss is seen and it impacts bandwidth. – Bi-sect – Vanderbilt to SOX. Loss still seen – Bi-sect again – Vanderbilt (ACCRE) to the edge of the campus. Loss is still seen. – Bi-sect again – Host to ACCRE edge. Loss is isolated to first hop switch • Cause: – Switch has 2 PCUs. One was plugged in completely, the other was slightly ajar. – The switching between PCUs had an internal effect on performance, dropping small amounts of packets very frequently. 104 – 5/25/2017, © 2009 Internet2 Conclusion • Performance tools are more than capable of helping to diagnose and fix network problems • Success depends on several factors • Full deployment • Sound methodology • Patience! • The use cases spotlight common themes • These won’t be the last performance problems we solve 105 – 5/25/2017, © 2009 Internet2 perfSONAR Use Cases February 11th 2010, APAN 29 – perfSONAR Workshop Jeff Boote, Senior Network Software Engineer For more information, visit psps.perfsonar.net 106 – 5/25/2017, © 2009 Internet2

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download slides 3