* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download LHCONE_perfSONAR_update-Taiwan
Survey
Document related concepts
Transcript
LHCOPN/LHCONE perfSONAR Update Ian Collier/RAL Presenting for Shawn McKee/UM LHCONE/LHCOPN Meeting Taipei, Taiwan March 13th, 2016 Overview of Talk perfSONAR Changes and updates WLCG, LHCONE and LHCOPN infrastructure overview Status and changes in our meshes Some new tools ElasticSearch, MadAlert and topology explorations for our data Summary and Discussion LHCONE-Taipei March 13, 2016 2 Importance of LHCONE perfSONAR As we start this presentation, it is important to note the usefulness of having LHCONE perfSONAR instance in place. Just within the last 2 months we have used instances in the US and Europe to help diagnose network issues We see a gap in coverage for Asia and it would be very good to get additional instances in place…especially in the regional R&E networks. We are hoping this LHCONE/LHCOPN meeting will be a chance to encourage additional instances in Asia to join the LHCONE monitoring mesh. Contact Shawn McKee and Marian Babik if you are interested! LHCONE-Taipei March 13, 2016 3 perfSONAR v3.5.1 Toolkit perfSONAR v3.5.1 released on the 4th of March 2016 Main themes for this release: A new web interface for creating/managing your regular tests Normalized package names, configuration files and paths Upgrade to Esmond (backward incompatibilities for writing data) Improved support for Debian 7 and 8 See release notes http://www.perfsonar.net/release-notes/version-3-5-1 In addition v3.5.1 incorporates feedback and bugfixes from our WLCG/OSG deployments, improving robustness. WLCG/OSG Deployment status as of today (great progress): 3.4.1 : 6 3.4.2 : 8 3.5 : 2 3.5.0 : 37 3.5.1 : 169 Unknown: 25 (These nodes are either down or hung) LHCONE-Taipei March 13, 2016 4 Review perfSONAR Deployment Options Configuration managed deployments via bundles (see http://docs.perfsonar.net/install_options.html ) perfSONAR Tools (just tools) perfSONAR TestPoint (passive, no MA) perfSONAR Core (+MA) perfSONAR Complete (+Web and Toolkit Configuration) perfSONAR Central Management (MaDDash, Auto-config, Centralized config service) Low-cost nodes to support large-scale deployment (http://docs.perfsonar.net/low_cost_nodes.html ) $100-200 range should enable broad deployment Small form factor enables more locations Some limitations in capabilities due to hardware VMs - Still not recommended but possible Target: whole node VMs, VMs with dedicated physical NICs Main use “end-to-end” infrastructure testing (not network) What about Docker? http://www.perfsonar.net/deploy/installation-and-configuration/ LHCONE-Taipei March 13, 2016 5 Map of perfSONAR Deployment http://grid-monitoring.cern.ch/perfsonar_report.txt for stats https://www.google.com/fusiontables/DataSource?docid=1QT4r17HEufkvnqhJu24nIptZ66XauYEIBWWh5Kpa#map:id=3 • Initial deployment coordinated by WLCG perfSONAR TF • Commissioning of the network followed by WLCG Network and Transfer Metrics WG LHCONE-Taipei March 13, 2016 6 Gathering & Storing Metrics OSG is providing network metric data for its members and WLCG via the Network Datastore The data is gathered from all WLCG/OSG perfSONAR instances Stored indefinitely on OSG hardware Data available via Esmond API In production since September 14th 2015 The primary use-cases Network problem identification and localization Network-related decision support Network baseline: set expectations and identify weak points for upgrading LHCONE-Taipei March 13, 2016 7 Review of perfSONAR Pipeline The diagram on the right provides a high-level view of how WLCG/OSG is managing our perfSONAR deployments, gathering metrics and making them available for use. End users can get monitor the data via the OSG MaDDash instance, grab the data directly from the OSG datastore or subscribe to the ActiveMQ bus at CERN LHCONE-Taipei March 13, 2016 8 Configuration for LHCOPN/LHCONE We have changed to use uni-directional tests for OWAMP to reduce the load Source host is responsible for initiating and recording test results to each destination We are using iperf3 as the baseline for bandwidth measurements (adds retry information) Fall fix for NDT ensured the TCP congestion protocol would use ‘htcp’ rather than ‘reno’ when NDT and NPAD are not in use and improves BW results. We are sending all the LHCOPN and LHCONE data into ElasticSearch (ongoing) LHCONE-Taipei March 13, 2016 9 Existing Test Coverage Current perfSONAR measurement coverage for WLCG/OSG: Full latency (one-direction only, 10Hz, OWAMP, IPv4) Full traceroute (bi-directional, hourly, BWCTL/OWAMP, IPv4, IPv6) Full bandwidth (one-direction only, fortnightly, BWCTL-only!, IPv4, IPv6) Regional meshes still disabled, need to discuss how to evolve We can create any sub-mesh of the full latency mesh (for free, but only IPv4 and using same params) We could move from regional to bigger meshes (European, Asia/Pacific, US) We can create new bandwidth meshes as bwclt needs fewer resources (but only for BWCTL-only nodes, not on dual-nodes) We re-enabled project meshes Belle II – both latency and bandwidth Dual-stack – just bandwidth (both IPv4 and IPv6) LHCONE/LHCOPN – These are separately tracked LHCONE-Taipei March 13, 2016 10 perfSONAR Monitoring Pages We have 3 versions of our perfSONAR monitoring pages Prototype at maddash.aglt2.org (intending to phase this out soon) Testing at OSG’s ITB instance Production at OSG’s production instance Main monitoring types are MaDDash and OMD/Check_MK Prototype: http://maddash.aglt2.org/maddash-webui https://maddash.aglt2.org/WLCGperfSONAR/check_mk Testing: http://perfsonar-itb.grid.iu.edu/maddash-webui/ https://perfsonar-itb.grid.iu.edu/WLCGperfSONAR/check_mk/ Production: http://psmad.grid.iu.edu/maddash-webui/ https://psomd.grid.iu.edu/WLCGperfSONAR/check_mk Notes: OSG instances rely upon OSG Datastore: http://psds.grid.iu.edu X509 cert needed to view check_mk/OMD pages (any IGTF cert) LHCONE-Taipei March 13, 2016 11 Check_mk for LHCONE/LHCOPN perfSONARs https://maddash.aglt2.org/WLCGperfSONAR/check_mk/ (Prototype) https://psomd.grid.iu.edu/WLCGperfSONAR/check_mk/ (Production) We monitor: • “Expected” test coverage • NDT/NPAD running? • Memory on hosts (<4GB) • New “version” test LHCONE-Taipei Access requires x509 credential from IGTF CA Gives us a good view into where problems still exist March 13, 2016 12 Monitoring Metrics Use MaDDash to view metric summaries Provide quick view about how networks are working OSG hosts production instance http://psmad.grid.iu.edu/maddash-webui/ • Metrics are displayed via source-destination matrix • Multiple dashboards (meshes) can be selected • Custom menus link to relevant resources • New release (2.0) will incorporate MadAlert http://maddash.aglt2.org/madalert.html LHCONE-Taipei March 13, 2016 13 Evolution of LHCOPN/LHCONE Monitoring As usual we will show how the monitoring in MaDDash is changing since the last meeting We have two known problems with LHCONE instances from GEANT and Internet2 GEANT instance in Amsterdam was recently upgraded to perfSONAR v3.5.1 BUT there is a problem writing to the updated Esmond The Internet2 instances are “multi-purpose” and have an MA which uses a different FQDN/IP than the LHCONE measurement interface. The current mesh-config isn’t setup to handle this configuration. LHCONE-Taipei Additionally there may be some problems with these v3.4.1 instances March 13, 2016 14 LHCONE MaDDash – 27 Oct 2015 Some issues getting data from Internet2/GEANT instances we need to look into LHCONE-Taipei March 13, 2016 15 LHCONE MaDDash – 11 Mar 2016 Things are looking a bit worse. We have known issues with the AMS_GEANT and Internet2 instances that are being worked on. Real issues into IN2P3 as well as problems outbound? Should be investigated. LHCONE-Taipei March 13, 2016 16 LHCOPN MaDDash – 27 Oct 2015 Some firewall problems for the OSG collector from FNAL. Setup being examined at INFN and PIC. Patches to address IPv6/IPv4 and http/https put in yesterday and things are broken Should be fixed later today. LHCONE-Taipei March 13, 2016 17 LHCOPN MaDDash – 11 Mar 2016 RAL and TRIUMF showing signs of continuing network problems. Latency mesh improved. BW mesh still shows many issues. Kisti still has BW problems. LHCONE-Taipei March 13, 2016 18 Existing Tools We have a number of tools available to help debug and understand network problems. There are very good presentations on these tools in the training materials provided by perfSONAR: http://www.perfsonar.net/about/training-materials/ While I don’t have time to cover all the details (see http://www.perfsonar.net/about/training-materials/201507-ps-training/ and especially the Measurement Tools, Use Cases and Debugging presentations from Jason Zurawski) I do want to note that command line tools exist to allow you to create on-demand 3rd party tests (between two remote instances) for bandwidth, latency and traceroute. Follow the debugging strategy as a guide to finding and fixing LHCONE/LHCOPN network issues using perfSONAR capabilities As for new tools…. LHCONE-Taipei March 13, 2016 19 ATLAS Network Metrics Pipeline Ilija Vukotic, Kaushik De, Rob Gardner and Jorge Batista are working with the Network and Transfer Metrics WG to make perfSONAR metrics available to PANDA See Ilija’s presentation at http://tinyurl.com/gt92zwb Pipeline: OSG Network Datastore -> CERN Active MQ -> Flume -> ES -> PANDA Prototype working and analytics being performed in Elastic Search to validate data (see following slide) Working on a network source-destination cost-matrix PANDA can use to evaluate options Interface details being discussed with PANDA team Could also be used to analyze LHCONE/LHCOPN data! LHCONE-Taipei March 13, 2016 20 perfSONAR Data into ElasticSearch Avg src loss % Avg dst loss % http://tinyurl.com/z4dnfs8 for example plots using WLCG data LHCONE-Taipei March 13, 2016 21 MadAlert: A project to analyze meshes Gabriele Carcassi has been working with me on creating a new utility to analyze meshes: MadAlert See details at http://madalert.aglt2.org/madalert/index.html You can see meshes and reports from the page Reports find both infrastructure and network problems We are now working with Andy Lake/ESnet to incorporate this into the next major release of MaDDash (v2.0) Now testing a “diff” to allow us to compare meshes; e.g., IPv4 vs IPv6, testing vs production, mesh(t1) vs mesh(t2) http://madalert.aglt2.org/madalert/testDiff.html Could be really helpful for understanding new software versions or changes in time. Time based comparison will require some modifications to MaDDash to allow specifying time-based meshes. LHCONE-Taipei March 13, 2016 22 Understanding Network Topology Can we create tools to manipulate, visualize, compare and analyze network topologies from the OSG network datastore contents? Can we build upon these tools to create a set of nextgeneration network diagnostic tools to make debugging network problems easier, quicker and more accurate? Even without requiring the ability to perform complicated data analysis and correlation, basic tools developed in the area of network topology-based metric visualization would be very helpful in letting users and network engineers better understand what is happening in our networks. This area is under active investigation in various projects. Lots of work to do here. LHCONE-Taipei March 13, 2016 23 Exploring Path Analysis We can correlate paths with packet-loss/latency information (PuNDIT) We can simplify the graph by aggregating nodes that belong to same NREN (visual debugging) Aachen GEANT DFN RAL JANET QMUL ITEP latency, packet-loss, throughput LHCONE-Taipei March 13, 2016 24 WLCG Support Unit Reminder: We have a GGUS support unit (WLCG Network Throughput; https://wiki.egi.eu/wiki/GGUS:WLCG_Network_Throughput) used to report incidents (mailing list: wlcg-network-throughput at cern.ch) Experiments can report potential network performance incidents. WLCG perfSONAR support investigates and confirms if this is network related issue. Once confirmed, it will notify relevant sites and will try to assist in narrowing down the problem to particular link(s). Tracking of ongoing incidents will be via the WG page. Sites observing a network performance problem should follow their standard procedure, i.e. report to their network team and if necessary escalate to their network provider. If confirmed to be WAN related, WLCG perfSONAR support unit can assist in further debugging. For the non-technical (policy) issues, sites should escalate to the WLCG operations coordination. https://twiki.cern.ch/twiki/bin/view/LCG/NetworkTransferMetrics#Network_Perf ormance_Incidents. LHCOPN/LHCONE experts are very important in this coordinated activity. LHCONE-Taipei March 13, 2016 25 Next Steps We are working on getting ALL WLCG/OSG perfSONAR instances fully operational and properly configured We have hints that some perfSONAR services stop or hang under some circumstances. Working with developers to isolate/fix. Some hosts are underpowered (<4GB in latency) or broken As we fix known issues and get to reliable operation, we can free up time to pursue possible issues in the network itself, rather than the framework that gets us network metrics. We need to plan for a campaign to clear up remaining LHCONE/LHCOPN problems. Currently working on the LHCONE issues we noted previously. Need more instances in Asia in the regional R&E networks!! LHCONE-Taipei March 13, 2016 26 Discussion/Questions/Comments? LHCONE-Taipei March 13, 2016 27 References Network Documentation https://www.opensciencegrid.org/bin/view/Documentation/NetworkingInOSG Deployment documentation for OSG and WLCG hosted in OSG https://twiki.opensciencegrid.org/bin/view/Documentation/DeployperfSONAR New MA guide http://software.es.net/esmond/perfsonar_client_rest.html Modular Dashboard and OMD Prototypes http://maddash.aglt2.org/maddash-webui https://maddash.aglt2.org/WLCGperfSONAR/check_mk OSG Production instances for OMD, MaDDash and Datastore http://psmad.grid.iu.edu/maddash-webui/ https://psomd.grid.iu.edu/WLCGperfSONAR/check_mk/ http://psds.grid.iu.edu/esmond/perfsonar/archive/?format=json Mesh-config in OSG https://oim.grid.iu.edu/oim/meshconfig New mesh config info: http://soichi7.ppa.iu.edu/pdoc/mca.html Send feedback to Soichi Use-cases document for experiments and middleware https://docs.google.com/document/d/1ceiNlTUJCwSuOuvbEHZnZp0XkWkwdkPQTQic0VbH1m c/edit LHCONE-Taipei March 13, 2016 28