Download LHCONE_perfSONAR_update-Taiwan

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Net neutrality law wikipedia , lookup

Distributed firewall wikipedia , lookup

Net bias wikipedia , lookup

Piggybacking (Internet access) wikipedia , lookup

Computer network wikipedia , lookup

Cracking of wireless networks wikipedia , lookup

Zero-configuration networking wikipedia , lookup

Network tap wikipedia , lookup

Airborne Networking wikipedia , lookup

Transcript
LHCOPN/LHCONE perfSONAR Update
Ian Collier/RAL Presenting
for Shawn McKee/UM
LHCONE/LHCOPN Meeting Taipei, Taiwan
March 13th, 2016
Overview of Talk
 perfSONAR Changes and updates
 WLCG, LHCONE and LHCOPN infrastructure overview

Status and changes in our meshes
 Some new tools

ElasticSearch, MadAlert and topology explorations for our data
 Summary and Discussion
LHCONE-Taipei
March 13, 2016
2
Importance of LHCONE perfSONAR
 As we start this presentation, it is important to note the
usefulness of having LHCONE perfSONAR instance in
place.

Just within the last 2 months we have used instances in the US and
Europe to help diagnose network issues

We see a gap in coverage for Asia and it would be very good to get
additional instances in place…especially in the regional R&E
networks.

We are hoping this LHCONE/LHCOPN meeting will be a chance to
encourage additional instances in Asia to join the LHCONE
monitoring mesh.

Contact Shawn McKee and Marian Babik if you are interested!
LHCONE-Taipei
March 13, 2016
3
perfSONAR v3.5.1 Toolkit
 perfSONAR v3.5.1 released on the 4th of March 2016
 Main themes for this release:





A new web interface for creating/managing your regular tests
Normalized package names, configuration files and paths
Upgrade to Esmond (backward incompatibilities for writing data)
Improved support for Debian 7 and 8
See release notes http://www.perfsonar.net/release-notes/version-3-5-1
 In addition v3.5.1 incorporates feedback and bugfixes from our
WLCG/OSG deployments, improving robustness.
 WLCG/OSG Deployment status as of today (great progress):






3.4.1 : 6
3.4.2 : 8
3.5 : 2
3.5.0 : 37
3.5.1 : 169
Unknown: 25 (These nodes are either down or hung)
LHCONE-Taipei
March 13, 2016
4
Review perfSONAR Deployment Options
 Configuration managed deployments via bundles (see
http://docs.perfsonar.net/install_options.html )





perfSONAR Tools (just tools)
perfSONAR TestPoint (passive, no MA)
perfSONAR Core (+MA)
perfSONAR Complete (+Web and Toolkit Configuration)
perfSONAR Central Management (MaDDash, Auto-config, Centralized config service)
 Low-cost nodes to support large-scale deployment
(http://docs.perfsonar.net/low_cost_nodes.html )



$100-200 range should enable broad deployment
Small form factor enables more locations
Some limitations in capabilities due to hardware
 VMs - Still not recommended but possible


Target: whole node VMs, VMs with dedicated physical NICs
Main use “end-to-end” infrastructure testing (not network)
 What about Docker?

http://www.perfsonar.net/deploy/installation-and-configuration/
LHCONE-Taipei
March 13, 2016
5
Map of perfSONAR Deployment
http://grid-monitoring.cern.ch/perfsonar_report.txt for stats
https://www.google.com/fusiontables/DataSource?docid=1QT4r17HEufkvnqhJu24nIptZ66XauYEIBWWh5Kpa#map:id=3
• Initial deployment coordinated by WLCG perfSONAR TF
• Commissioning of the network followed by WLCG Network and
Transfer Metrics WG
LHCONE-Taipei
March 13, 2016
6
Gathering & Storing Metrics
 OSG is providing network metric data for its members and
WLCG via the Network Datastore

The data is gathered from all WLCG/OSG perfSONAR instances

Stored indefinitely on OSG hardware

Data available via Esmond API

In production since September 14th 2015
 The primary use-cases

Network problem identification and localization

Network-related decision support

Network baseline: set expectations and identify weak points for
upgrading
LHCONE-Taipei
March 13, 2016
7
Review of perfSONAR Pipeline
The diagram on the right
provides a high-level view of
how WLCG/OSG is
managing our perfSONAR
deployments, gathering
metrics and making them
available for use.
End users can get monitor
the data via the OSG
MaDDash instance, grab the
data directly from the
OSG datastore or subscribe
to the ActiveMQ bus at CERN
LHCONE-Taipei
March 13, 2016
8
Configuration for LHCOPN/LHCONE
 We have changed to use uni-directional tests for
OWAMP to reduce the load
 Source
host is responsible for initiating and recording
test results to each destination
 We are using iperf3 as the baseline for bandwidth
measurements (adds retry information)
 Fall
fix for NDT ensured the TCP congestion protocol
would use ‘htcp’ rather than ‘reno’ when NDT and NPAD
are not in use and improves BW results.
 We are sending all the LHCOPN and LHCONE
data into ElasticSearch (ongoing)
LHCONE-Taipei
March 13, 2016
9
Existing Test Coverage
 Current perfSONAR measurement coverage for WLCG/OSG:

Full latency (one-direction only, 10Hz, OWAMP, IPv4)

Full traceroute (bi-directional, hourly, BWCTL/OWAMP, IPv4, IPv6)

Full bandwidth (one-direction only, fortnightly, BWCTL-only!, IPv4, IPv6)
 Regional meshes still disabled, need to discuss how to evolve

We can create any sub-mesh of the full latency mesh (for free, but only IPv4 and
using same params)


We could move from regional to bigger meshes (European, Asia/Pacific, US)
We can create new bandwidth meshes as bwclt needs fewer resources (but only for
BWCTL-only nodes, not on dual-nodes)
 We re-enabled project meshes

Belle II – both latency and bandwidth

Dual-stack – just bandwidth (both IPv4 and IPv6)

LHCONE/LHCOPN – These are separately tracked
LHCONE-Taipei
March 13, 2016
10
perfSONAR Monitoring Pages
 We have 3 versions of our perfSONAR monitoring pages

Prototype at maddash.aglt2.org (intending to phase this out soon)
 Testing at OSG’s ITB instance
 Production at OSG’s production instance
 Main monitoring types are MaDDash and OMD/Check_MK

Prototype: http://maddash.aglt2.org/maddash-webui
https://maddash.aglt2.org/WLCGperfSONAR/check_mk
 Testing:
http://perfsonar-itb.grid.iu.edu/maddash-webui/
https://perfsonar-itb.grid.iu.edu/WLCGperfSONAR/check_mk/
 Production: http://psmad.grid.iu.edu/maddash-webui/
https://psomd.grid.iu.edu/WLCGperfSONAR/check_mk
 Notes:

OSG instances rely upon OSG Datastore: http://psds.grid.iu.edu
 X509 cert needed to view check_mk/OMD pages (any IGTF cert)
LHCONE-Taipei
March 13, 2016
11
Check_mk for LHCONE/LHCOPN perfSONARs
https://maddash.aglt2.org/WLCGperfSONAR/check_mk/ (Prototype)
https://psomd.grid.iu.edu/WLCGperfSONAR/check_mk/ (Production)
We monitor:
• “Expected” test coverage
• NDT/NPAD running?
• Memory on hosts (<4GB)
• New “version” test
LHCONE-Taipei
Access requires x509 credential from IGTF CA
Gives us a good view into where problems still
exist
March 13, 2016
12
Monitoring Metrics
 Use MaDDash to view metric summaries

Provide quick view about how networks are working
 OSG hosts production instance
http://psmad.grid.iu.edu/maddash-webui/
• Metrics are displayed via
source-destination matrix
• Multiple dashboards
(meshes) can be selected
• Custom menus link to
relevant resources
• New release (2.0) will
incorporate MadAlert http://maddash.aglt2.org/madalert.html
LHCONE-Taipei
March 13, 2016
13
Evolution of LHCOPN/LHCONE Monitoring
 As usual we will show how the monitoring in MaDDash is
changing since the last meeting
 We have two known problems with LHCONE instances
from GEANT and Internet2

GEANT instance in Amsterdam was recently upgraded to
perfSONAR v3.5.1 BUT there is a problem writing to the updated
Esmond

The Internet2 instances are “multi-purpose” and have an MA which
uses a different FQDN/IP than the LHCONE measurement interface.
The current mesh-config isn’t setup to handle this configuration.

LHCONE-Taipei
Additionally there may be some problems with these v3.4.1 instances
March 13, 2016
14
LHCONE MaDDash – 27 Oct 2015
Some issues getting data from Internet2/GEANT instances we need to look into
LHCONE-Taipei
March 13, 2016
15
LHCONE MaDDash – 11 Mar 2016
Things are looking a bit worse. We have known issues with the AMS_GEANT and
Internet2 instances that are being worked on. Real issues into IN2P3 as well as
problems outbound? Should be investigated.
LHCONE-Taipei
March 13, 2016
16
LHCOPN MaDDash – 27 Oct 2015
Some firewall problems for the OSG collector from FNAL. Setup being examined at
INFN and PIC. Patches to address IPv6/IPv4 and http/https put in yesterday and things
are broken  Should be fixed later today.
LHCONE-Taipei
March 13, 2016
17
LHCOPN MaDDash – 11 Mar 2016
RAL and TRIUMF showing signs of continuing network problems. Latency mesh
improved. BW mesh still shows many issues. Kisti still has BW problems.
LHCONE-Taipei
March 13, 2016
18
Existing Tools
 We have a number of tools available to help debug and understand
network problems.
 There are very good presentations on these tools in the training
materials provided by perfSONAR:
http://www.perfsonar.net/about/training-materials/
 While I don’t have time to cover all the details (see
http://www.perfsonar.net/about/training-materials/201507-ps-training/
and especially the Measurement Tools, Use Cases and Debugging
presentations from Jason Zurawski) I do want to note that command
line tools exist to allow you to create on-demand 3rd party tests
(between two remote instances) for bandwidth, latency and traceroute.

Follow the debugging strategy as a guide to finding and fixing
LHCONE/LHCOPN network issues using perfSONAR capabilities
 As for new tools….
LHCONE-Taipei
March 13, 2016
19
ATLAS Network Metrics Pipeline
 Ilija Vukotic, Kaushik De, Rob Gardner and Jorge Batista are working
with the Network and Transfer Metrics WG to make perfSONAR
metrics available to PANDA

See Ilija’s presentation at http://tinyurl.com/gt92zwb
 Pipeline: OSG Network Datastore -> CERN Active MQ -> Flume -> ES
-> PANDA
 Prototype working and analytics being performed in Elastic Search to
validate data (see following slide)
 Working on a network source-destination cost-matrix PANDA can use
to evaluate options

Interface details being discussed with PANDA team
 Could also be used to analyze LHCONE/LHCOPN data!
LHCONE-Taipei
March 13, 2016
20
perfSONAR Data into ElasticSearch
Avg src loss
%
Avg dst loss
%
http://tinyurl.com/z4dnfs8 for example plots using WLCG data
LHCONE-Taipei
March 13, 2016
21
MadAlert: A project to analyze meshes
 Gabriele Carcassi has been working with me on creating a
new utility to analyze meshes: MadAlert
 See details at http://madalert.aglt2.org/madalert/index.html


You can see meshes and reports from the page
Reports find both infrastructure and network problems
 We are now working with Andy Lake/ESnet to incorporate
this into the next major release of MaDDash (v2.0)
 Now testing a “diff” to allow us to compare meshes; e.g.,
IPv4 vs IPv6, testing vs production, mesh(t1) vs mesh(t2)


http://madalert.aglt2.org/madalert/testDiff.html
Could be really helpful for understanding new software versions or
changes in time. Time based comparison will require some
modifications to MaDDash to allow specifying time-based
meshes.
LHCONE-Taipei
March 13, 2016
22
Understanding Network Topology
 Can we create tools to manipulate, visualize, compare
and analyze network topologies from the OSG network
datastore contents?
 Can we build upon these tools to create a set of nextgeneration network diagnostic tools to make debugging
network problems easier, quicker and more accurate?
 Even without requiring the ability to perform complicated
data analysis and correlation, basic tools developed in the
area of network topology-based metric visualization would
be very helpful in letting users and network engineers better
understand what is happening in our networks.
 This area is under active investigation in various projects.
Lots of work to do here.
LHCONE-Taipei
March 13, 2016
23
Exploring Path Analysis
We can correlate paths
with packet-loss/latency
information (PuNDIT)
We can simplify the
graph by aggregating
nodes that belong to
same NREN
(visual debugging)
Aachen
GEANT
DFN
RAL
JANET
QMUL
ITEP
latency, packet-loss,
throughput
LHCONE-Taipei
March 13, 2016
24
WLCG Support Unit
 Reminder: We have a GGUS support unit (WLCG Network
Throughput; https://wiki.egi.eu/wiki/GGUS:WLCG_Network_Throughput)
used to report incidents (mailing list: wlcg-network-throughput at cern.ch)
 Experiments can report potential network performance incidents.

WLCG perfSONAR support investigates and confirms if this is network related issue.
 Once confirmed, it will notify relevant sites and will try to assist in narrowing down the
problem to particular link(s). Tracking of ongoing incidents will be via the WG page.
 Sites observing a network performance problem should follow their
standard procedure, i.e. report to their network team and if necessary
escalate to their network provider.

If confirmed to be WAN related, WLCG perfSONAR support unit can assist in further
debugging. For the non-technical (policy) issues, sites should escalate to the WLCG
operations coordination.

https://twiki.cern.ch/twiki/bin/view/LCG/NetworkTransferMetrics#Network_Perf
ormance_Incidents.
 LHCOPN/LHCONE experts are very important in this
coordinated activity.
LHCONE-Taipei
March 13, 2016
25
Next Steps
 We are working on getting ALL WLCG/OSG perfSONAR
instances fully operational and properly configured


We have hints that some perfSONAR services stop or hang under
some circumstances. Working with developers to isolate/fix.
Some hosts are underpowered (<4GB in latency) or broken
 As we fix known issues and get to reliable operation, we
can free up time to pursue possible issues in the
network itself, rather than the framework that gets us
network metrics.
 We need to plan for a campaign to clear up remaining
LHCONE/LHCOPN problems.


Currently working on the LHCONE issues we noted previously.
Need more instances in Asia in the regional R&E networks!!
LHCONE-Taipei
March 13, 2016
26
Discussion/Questions/Comments?
LHCONE-Taipei
March 13, 2016
27
References
 Network Documentation
https://www.opensciencegrid.org/bin/view/Documentation/NetworkingInOSG
 Deployment documentation for OSG and WLCG hosted in OSG
https://twiki.opensciencegrid.org/bin/view/Documentation/DeployperfSONAR
 New MA guide http://software.es.net/esmond/perfsonar_client_rest.html
 Modular Dashboard and OMD Prototypes
http://maddash.aglt2.org/maddash-webui
https://maddash.aglt2.org/WLCGperfSONAR/check_mk
 OSG Production instances for OMD, MaDDash and Datastore
http://psmad.grid.iu.edu/maddash-webui/
https://psomd.grid.iu.edu/WLCGperfSONAR/check_mk/
http://psds.grid.iu.edu/esmond/perfsonar/archive/?format=json
 Mesh-config in OSG https://oim.grid.iu.edu/oim/meshconfig
New
mesh config info: http://soichi7.ppa.iu.edu/pdoc/mca.html
Send feedback to Soichi
 Use-cases document for experiments and middleware
https://docs.google.com/document/d/1ceiNlTUJCwSuOuvbEHZnZp0XkWkwdkPQTQic0VbH1m
c/edit
LHCONE-Taipei
March 13, 2016
28