Download LSST Network Operations and Management Plan

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Asynchronous Transfer Mode wikipedia , lookup

Deep packet inspection wikipedia , lookup

Recursive InterNetwork Architecture (RINA) wikipedia , lookup

Net neutrality law wikipedia , lookup

Zero-configuration networking wikipedia , lookup

Wake-on-LAN wikipedia , lookup

Distributed firewall wikipedia , lookup

Net bias wikipedia , lookup

Piggybacking (Internet access) wikipedia , lookup

Computer network wikipedia , lookup

Cracking of wireless networks wikipedia , lookup

Airborne Networking wikipedia , lookup

Network tap wikipedia , lookup

Transcript
1
Large Synoptic Survey Telescope (LSST)
LSST Network Operations and
Management Plan
Julio Ibarra, Chip Cox, Sandra Jaque, Ron Lambert, Jeff Kantor,
Jeronimo Aguiar, James Grace, Mike Freemon, Tim Boerner, Dave
Wheeler, Albert Astudillo
Document-11918
Latest Revision Date: 5/4/2017 5:04:00 PM
Change Record
Version
Date
Description
Initial draft
Owner name
1.0
8/3/11
1.1
6/12/13
1.2
7/12/13
1.3
7/24/13
1.4
7/26/13
Sections 4.4.1, 4.4.2, and 4.4.3 added
Jeronimo Aguiar
1.5
8/9/13
Contributions on sections 2.1, 2.2, 2.4 and 4.4
Sandra Jaque
1.5
8/9/13
Updates to sections 2, 3, 4, 5
Chip Cox
1.5
8/22/13
Updates to sections 2, 5 and 8
Julio Ibarra
1.6
8/23/13
1.7
9/6/13
1.8
9/8/13
1.9
9/20/13
Revisions to sections 3, 4.2, 4.3.1, 4.4, 5
Julio Ibarra
1.10
10/16/13
Appendix 12.2 Measurement Instrumentation
Jeronimo Aguiar
1.11
11/8/13
Provided comments to various sections
Jeronimo Aguiar
1.12
11/15/13
Update to section 5.3
Sandra Jaque
1.13
11/21/13
Incorporated Jeronimo’s comments; section 4.4.6
Julio Ibarra
Revised initial draft with Long-Haul Network review
recommendations
Added comments and prepared document for
review by working group
Replaced Figure 1. Modified version 1.2 using notes
form the last call
Incorporated contributions from Sandra and Chip.
Section 2.4
Section 5, and Appendix on parameters for alert
nofification
Added Comments with assignments from the call
on 9/6
Julio Ibarra
Jeff Kantor
Julio Ibarra
Julio Ibarra
Julio Ibarra
Jeronimo Aguiar
Julio Ibarra
1
1.14
11/22/13
Updates to section 7.3; merged part of section 9 to
section 1; replaced Figure 1.
Julio Ibarra
1.14
11/26/13
Updates to section 7 and several other sections
Jeronimo Aguiar
2
Table of Contents
Change Record ...................................................................................................................... 1
1
Summary ......................................................................................................................... 1
2
Network Services ............................................................................................................. 1
2.1 Ethernet Transport Service ....................................................................................................2
2.2 IP routed services..................................................................................................................2
2.3 Bandwidth Management Services..........................................................................................2
2.4 Diagnostic Services and Tools ................................................................................................2
2.4.1 End-point Reachability ............................................................................................................. 2
2.4.2 Circuit Latency .......................................................................................................................... 2
2.4.3 Circuit Throughput ................................................................................................................... 3
3
Responsibilities ............................................................................................................... 3
3.1
4
Relationships between LSST and the Parties ..........................................................................3
NOC Services ................................................................................................................... 4
4.1 Hours of Operations - NOC Service Desk ................................................................................4
4.2 NOC Services System .............................................................................................................4
4.3 Service Requests ...................................................................................................................4
4.3.1 Submitting a Service Request to the LSST NET......................................................................... 5
4.3.2 Tracking Service Requests: ....................................................................................................... 5
4.4 NOC System Tools and Components: .....................................................................................5
4.4.1 Weather Map ........................................................................................................................... 5
4.4.2 Status of Circuits....................................................................................................................... 6
4.4.3 Latency Charts .......................................................................................................................... 6
4.4.4 Trouble Ticket System .............................................................................................................. 6
4.4.5 Network Performance Monitoring System .............................................................................. 6
4.4.6 LSST NET Website ..................................................................................................................... 7
5
Monitoring, Network Alert Notifications and Outage Management .................................. 7
5.1
5.2
5.3
5.4
Real-time Network Monitoring system ..................................................................................7
Monitoring tools ...................................................................................................................7
Router Proxy .........................................................................................................................8
Outage Management ............................................................................................................8
6
Reporting ........................................................................................................................ 9
7
Performance and Tuning.................................................................................................. 9
7.1
7.2
7.3
8
Data Integrity Mechanisms .................................................................................................. 10
Link Integrity Mechanisms ................................................................................................... 10
Network/Link Performance Issues: ...................................................................................... 10
Maintenance (per Segment) .......................................................................................... 10
8.1
Maintenance Notification and Operations Calendar ............................................................. 11
9
Installation and Integration ........................................................................................... 11
10
Management............................................................................................................... 11
1
10.1 Network Engineering Team ............................................................................................... 11
10.2 Escalation List and Outage Notification Procedures ............................................................ 11
10.2.1 AMPATH Escalation List ....................................................................................................... 11
10.2.2 REUNA Escalation List .......................................................................................................... 12
10.2.3 NCSA Escalation List ............................................................................................................. 12
11
Appendix..................................................................................................................... 16
11.1
11.2
Parameters for Network Alert Notifications ....................................................................... 16
Measurement Instrumentation ......................................................................................... 19
2
1 Summary
This document addresses the operational relationship between parties providing Network Engineering
Team (NET) Services for LSST (known as LSST:NET). The parties are AURA/LSST Corporation (herein
referred to as LSST), Florida International University (FIU), the National Center for Supercomputing
Applications (NCSA), and Red Universitaria Nacional (REUNA).
THIS DOCUMENT IS NOT A CONTRACT OR LEGAL DOCUMENT AND IS THEREFORE NON-BINDING AS
AMONG THE PARTIES PREVIOUSLY DESCRIBED OTHER THAN AS AN ADDENDUM TO A SEPARATE
CONTRACT BETWEEN SAID PARTIES.
The overall network design is described in LSE-78 LSST Observatory Network Design. In addition, the
network is subject to the security plans and procedures described in LSE-99 LSST Cybersecurity Plan. In
case of any conflict between this document and LSE-78 or LSE-99, LSE-78 and LSE-99 shall have
precedence.
Specifically, this document aims to describe the roles of each of the Institutions supporting the transport
and security of LSST data to its archive facilities at NCSA and LSST in Tucson. The goal of this approach is
to have a centralized internally staffed LSST Network Architecture Team (NAT), supported by a set of
geographically distributed, integrated Network Operations Centers (NOCs) and engineers operating as a
single coordinated operations and engineering team for LSST – the LSST Network Engineering Team
(LSST:NET). The LSST:NET will consist of the NAT and each of the participating NOCs.
This document is aimed at describing the services required by LSST and establishing guidelines,
expectations and a general overview of services roles and responsibilities of each NOC; specifically, as
related to Maintenance (per segment), Outage Management, Performance and Tuning, Integration and
Configuration, Provisioning and Installation, Contracting and Management. As LSST approaches
operations, this plan will be expanded with detailed procedures, checklists, and other documentation in
each area.
Parties agree that unilateral amendments to this document are prohibited. Provisions materially or
substantially affecting the scope of work as herein described may not be added, altered or removed
without
 Formal, written approval of all changes by LSST NET prior to implementation of the change(s).
 Documentation shall be provided by the party providing the change describing the foreseeable
impact to the scope of work at least 1 week before said meeting is to take place.
2 Network Services
LSST will depend upon the following network services
Ethernet Transport (end-to-end)
IP routed services for access to the Commodity Internet, Internet2, NLR and other backbone
networks
Bandwidth Management Services
Diagnostic services and tools
1
2.1 Ethernet Transport Service
Ethernet is a widely adopted transport protocol to establish persistent end-to-end circuits. The R&E
networking community has adopted VLAN tagging (IEEE 802.1Q) as a methodology to extend an
Ethernet segment across multiple administrative domains. LSST data will be traversing multiple network
domains from La Serena to NCSA. The network operators responsible for each network domain shall
establish a process for the provisioning of Vlans, end-to-end, either statically or dynamically.
2.2 IP routed services
IP routed services for access to the Commodity Internet, Internet2, NLR and other backbone networks
shall be the traditional methodology for Inter-domain network communications using layer 3 protocols.
LSST will need to interconnect to Commodity Internet, as well as Internet2, NLR and other R&E
backbone networks. The Layer 3 protocol that will be used between the different parties is the Border
Gateway Protocol (BGP) using double stack prefixes IPv4 and IPv6. Internal to each party, the routing
protocol is an internal choice that doesn’t affect the overall routing.
2.3 Bandwidth Management Services
A set of tests shall be performed on a regular basis to ensure LSST’s network requirements are satisfied.
Traffic characteristics for LSST, such as bulk data movement, rapid movement of GB sized files, etc., shall
be tested routinely to verify network requirements. Bandwidth (Wire Speed), Latency, Jitter, Packet
Rate, Maximum Payload Throughput, etc. are tests that will be performed to verify the availability of
bandwidth. The next section, 2.4, describes tests and tools for verifying bandwidth.
2.4 Diagnostic Services and Tools
LSST:NET shall establish a baseline set of diagnostic tools in order to be able to conduct a baseline set of
tests to verify the end-to-end performance and proper operation of the network. The following subsections describe several tests and tools to include in the baseline diagnostic tests for the network.
2.4.1
End-point Reachability
End point reachability involves determining if the end host IP address is reachable across the network.
The LSST NOC shall use a tool that will provide a representation of all the end points that must be
reachable. The tool shall be configurable to set intervals for testing reachability. AmLight suggests up to
a 5 minute interval for testing end point reachability. Suggested tools are: ping, smokeping, zabbix,
nagios.
2.4.2
Circuit Latency
Latency is a unit of time, normally measured in milliseconds, of the delay interval of time between the
sender and the receiver. Latency is measured either one-way (the time from the source sending a
packet to the destination receiving it), or round trip (the one-way latency from source to destination
plus the one-way latency from the destination1 back to the source).
ICMP and OWAMP are protocols used to measure latency. Tools may be configured to perform latency
tests at regular intervals. AmLight suggests 5 minute test intervals. Suggested tools are: ping, hping,
owamp, zabbix, nagios, perfSonar.
1
A more precise definition also factors in the time spent by the destination to both process the
incoming packet and to send an answer back to the source.
2
2.4.3
Circuit Throughput
Throughput on an Ethernet circuit is the maximum rate at which none of the offered frames are
dropped by any device in the path. Bandwidth tests should be scheduled and performed regularly. Full
circuit speed tests are suggested every twelve hours (0000, 1200 UTC-4), using 3 Minute UDP tests, and
6 Minute TCP tests. Test durations can be altered as needed. Suggested Tools are iperf, nuttcp, bwctl,
perfSonar. Verification may be performed using SNMP through interface counters.
3
Responsibilities
Figure 1 depicts the network segments within the LSST Observatory Network and which party is
primarily responsible for Operations and Management. Refer to LSE-78 for complete discussion of
network segments.
The parties responsible for the Operations and Management of the network segments for LSST are
organized into the following three groups: First, the LSST Network Architecture Team (NAT). The LSST
Lead Network Engineer is a member of the NAT. Second, the LSST Network Engineering Team
(LSST:NET). Third, Participating Network Operations Centers (NOCs), constituted by AMPATH, REUNA,
and LSST.
Figure 1: LSST Network Segments
3.1
Relationships between LSST and the Parties
The segment from the Summit Site on Cerro Pachon to the Base Site in La Serena shall be contracted
and operated by LSST. The network segments from Santiago to La Serena shall be contracted and
operated by REUNA. Network segments from Santiago to Chicago shall be contracted by FIU, as part of
3
the AMLIGHT project. The network segment from Chicago to Champaign shall be contracted by NCSA.
In particular, the following table defines responsibility for each segment.
Table 1 Contracting Party and Responsible NOC for each network segment
Network Segment:
SANTIAGO – PANAMA
PANAMA – Los Angeles
SANTIAGO – SAO PAOLO
Sao Paulo - Miami
MIAMI – CHICAGO
LOS ANGELES – TUCSON
TUCSON - CHICAGO
CHAMPAIGN – CHICAGO
LA SERENA – SANTIAGO
CERRO PACHON – LA SERENA
Contracting Party:
FIU
FIU
FIU
FIU
FIU
FIU
FIU
NCSA
REUNA
LSST
Responsible NOC:
AMPATH
AMPATH, CENIC
AMPATH, ANSP
AMPATH
AMPATH
AMPATH, CENIC
AMPATH, CENIC
NCSA
REUNA
LSST
4 NOC Services
The following are a list of required operational criteria that the integrated NOC must at all times
maintain.
4.1 Hours of Operations - NOC Service Desk
The LSST NOC must maintain a Service Desk, staffed by Operators and reachable at all times: 24 hours a
day, 7 days a week and 365 days a year.
4.2 NOC Services System
A system shall be identified that provides the required NOC Services for LSST. This LSST NOC Services
System shall facilitate communication and coordination among participating NOCs. All of the
participating NOCs shall implement the NOC Services System. Interoperability between the chosen LSST
NOC Services System and other NOC systems in use at participating NOCs would be highly desirable.
Each participating NOC shall utilize this system to receive network alert notifications and other event
driven information about the operational status of the network. The NOC Services system will include a
trouble ticket database, which will be integrated with the overall LSST operations trouble ticket
database. These services will be available in an integrated fashion via a portal.
4.3 Service Requests
The LSST NOC shall receive requests through service request channels. Service request channels refer to
bidirectional communication channels, such as a publicly accessible e-mail address, phone number,
website, messaging-chat system, etc. Service requests to the LSST:NET will be submitted by the LSST
Lead Network Engineer, who is the leader of the NAT. If a request arrives at the general ticket system,
and if it’s routed to the internal network engineering team queue, the Lead Network Engineer will
determine if it should be forwarded to the internal LSST NET. This is a filtered process so that all
requests do not go to the internal NOCs.
4
4.3.1
Submitting a Service Request to the LSST NET
To submit a request to the LSST NET, the channels that shall be made available are a web form, an email
or voice communications.
4.3.2
Tracking Service Requests:
All service requests shall be tracked via a ticket, automatically issued from an Issue tracking system, such
as JIRA2. All actions performed on a service request shall be recorded in the Service Request form.
4.4 NOC System Tools and Components:
The NOC System shall consist of a set of standard tools to represent the status of all the network
segments that LSST data traverses. The following is a well-known preliminary set of tools used for
reporting the status of the network and to report network problems. A web-based portal will be used
by each of the NOCs to access the system tools.
Tools:
Weather Map
Status of Circuits
Latency Charts
Trouble Ticket System
Network Performance Monitoring System
LSST NET Website
Participating NOC Resource Provider:
AMPATH
AMPATH
AMPATH
LSST
REUNA/LSST/AMPATH
LSST/REUNA/AMPATH
The following sections provide brief descriptions for each of the NOC System tools and components.
4.4.1
Weather Map
Network Weathermap is a network visualization tool to show network utilization on a per segment basis
in a map form. The Weather Map tool may be used to create a map of the network topology of each of
the LSST network segments and report on their utilization. Below is an example of the weather map
tool used at AMPATH-AmLight.
2
JIRA is developed by Atlassian. It’s a web-based application that’s currently being used by the
LSST Data Management team for bug tracking. It can be used to create a home page with links
to other tools. LSST is moving to implement products from Atlassian. JIRA could be used by
NOCs in this plan.
5
4.4.2
Status of Circuits
Status of circuits may be monitored by a SNMP software with two main goals: (1) monitor if the circuit is
operational, and (2) measure its utilization. Using SNMP, port counters of all devices in the end-to-end
path may be monitored. Using counters it is also possible to monitor errors in the interface, as packet
errors due to link problems or MTU issues.
4.4.3
Latency Charts
To complement the status of circuits monitoring, it’s important to evaluate the latency between all
devices in the path. Tools for measuring latency were described in section 2.4.2. Below is an example of
a latency chart:
Using Latency Charts helps the NOC to proactively detect an issue in the path, errors and switch
protection in the carriers.
4.4.4
Trouble Ticket System
A trouble ticket system is used to assign and coordinate tasks and to manage requests among a
community of users. It may be used to track events, failures and issues affecting LSST users.
4.4.5 Network Performance Monitoring System
The network performance monitoring system for LSST shall provide a collection of tools to monitor and
measure each network segment that crosses multiple network domains. Tools shall be able to gather
6
metrics for both passive and active measurements. Active Measurements include the following metrics:
Achievable Bandwidth, One-way delay, Layer 3 Path, Round Trip Delay. Passive Measurements include
the following metrics: Layer 1 and Layer 2 statistics (e.g., SNMP); Flow Observation (e.g., Netflow,
sFlow).
4.4.6
LSST NET Website
A representation of the LSST network and its segments (described in section 3 and Figure 1), shall be on
the LSST Network website. This web site shall display real-time status information, such as alerts. A
weather map is a possible representation of the LSST network and its segments. The LSST NET and each
of the participating NOCs shall each display and monitoring this information on the LSST NET web site.
5 Monitoring, Network Alert Notifications and Outage Management
The LSST NOC shall provide proactive monitoring of all elements of the LSST network, and shall generate
service alert alarms for a variety of network-centric services, including but not limited to:
o
o
o
o
o
o
o
o
o
o
o
Interface up/down status
Discards/Errors counters
Interface utilization
Non-Unicast Packet rate
Network device CPU utilization
IGP adjacency status (IS-IS, OSPF)
BGP neighbor status
BGP accepted/received prefixes
Management reachability
NTP status
Laser Received Power Level
These and other alert parameters may be monitored using standard SNMP-based tools, such as Nagios,
Cacti, Zabbix, Tivoli Netview, CA, etc. One or more of these web-based network monitoring tools that
support SNMP shall be used to provide the LSST NOC with proactive monitoring functions, with an
overall view of the network, including a weathermap of the network. A more complete description of
each parameter may be found in the Appendix.
5.1 Real-time Network Monitoring system
A Real-time network monitoring system shall consist of the NOC System Tools and Components
described in section 4.4. Monitoring tools shall be configured to monitor for events of possible outage,
then to send out alert notifications to parties responsible for responding to events. The tools listed in
section 4.4 shall provide reports to show that the network links are operational and performing as
required.
5.2 Monitoring tools
A suite of tools shall be established for LSST Network Operations to monitor and test the health of the
network. Such tools may be based on current tools that support inter-domain performance monitoring,
such as perfSONAR. Appendix 11.2 contains a recommendation for the use of perfSonar to perform at
least two very important tests: (1) measure bandwidth for end-to-end tests; and (2) network delay.
7
5.3 Router Proxy
A Router Proxy provides a web-based tool interface, allowing users to query a router through the web
interface instead of having to be directly connected to the device. A router proxy is commonly used for
troubleshooting, such as to check the IP route advertisement, BGP path, traceroute, ping, and other
commands a router is permitted to execute via the proxy. Output from the router is then displayed in
the web interface.
There are different implementations for router proxies. Router proxies may be implemented with public
access, or private, limiting access to directly connected users. Typically, it is the NOC administrator who
decides which commands may be executed in a router proxy. For example, NOC administrators may
limit execution to a single command, such as a traceroute.
Router proxies currently operating are the following:
 AMPATH: http://routerproxy.grnoc.iu.edu/ampath/
 Internet2: provides router proxy services for its IP network at
o http://routerproxy.grnoc.iu.edu/internet2/, and its AL2S 100G network at
http://routerproxy.grnoc.iu.edu/al2s/
 ESNET Router Server: http://traceroute.es.net/cgi-bin/trace
 REUNA has a private router proxy service that requires credentials to access. REUNA has offered to
include the AURA border router within the scope of its router proxy. This will permit the AURA
border router to be queried.
5.4 Outage Management
Outages are events that result in loss of services. The monitoring tools and alert notifications described
previously are used to detect outages and the events that caused them. Events that result in outages
normally are fiber cable cuts, failure of active equipment, operator error, etc. Monitoring tools track
events. Events signal possible outages.
Outage Management functions:
 Localization of the outage.
 Extent of the outage: determination, dissemination of information
 Restoration efforts: information gathering, dissemination, coordination
Participating NOCs and engineering teams will use an internal trouble ticket system to track network
issues and coordinate among themselves. Information shall be disseminated to upstream providers or
downstream organizations affected by the outage via e-mail or phone calls, and provide updates to
network engineers as reasonably needed.
Outage management/monitoring tools must report alert notifications to guide the LSST:NET Team in the
resolution of outages and restoration of services. When an alert notification occurs, the LSST:NET shall
perform a procedure that corresponds to the alert. Alarms should be seen by all NET members. The
NOC responsible for handling the event, immediately starts executing a corrective action procedure.
Table 2 below provides a non-exhaustive list of alert notification messages and a description of the
corresponding procedure.
8
Table 2 Alert Nofication with corresponding Procedure
Alert Notification Message:
Procedure to be executed:
Link DOWN/UP
Identify the segment; responsible NOC should open a
ticket with the responsible Carrier or colocation facility.
BGP Session DOWN/UP
Identify the BGP router; responsible NOC s should
identify the cause and restore the session
High CPU Utilization
Identify the router; responsible NOC should identify the
cause and solve the problem.
Data/Time Not Synchronized
Identify the NTP server; responsible NOC should
identify the cause and fix the synchronization.
BGP Received Prefixes has changed over 20-50%
Identify the BGP router; responsible NOCs should work
together to understand what happened and define if an
action should be taken.
Discards/Errors detected on Interface
Identify the router; responsible NOC should contact the
carrier and/or the colocation facility.
Delay/RTT has changed more than 20-50%
Identify which segment had its delay increased;
responsible NOC should call the Carrier
Received Power Laser Level is too low (<15 dBm)
Identify the port and the router; Call Colo Facility
responsible for the cross-connection. Maybe it would
also be necessary to call the carrier
A probe or a perfSonar server is not responding
Identify the device; responsible NOC should work to fix
the problem.
6 Reporting
Routinely generate monthly, quarterly, and annual reports that reflect outage and maintenance activity,
network availability statistics, and general trouble ticket analysis. Report generation shall be performed
by the LSST Lead Network Engineer.
Reports shall be made available on the LSST NET website under “Support” or “Reports”: These reports
shall also be used for general overview of the LSST network operations and engineering team during
regularly scheduled NOC Operations calls.
7 Performance and Tuning
A performance monitoring system will be deployed to measure the performance of the end-to-end link
from La Serena to NCSA. The performance monitoring tool can be used to verify the performance of the
end-to-end link. In particular, link integrity is an important test when data corruption is reported or
suspected. The NET shall provide a procedure for users to test the performance of the link and to verify
that the network is performing as expected, and not the cause of a data integrity problem.
9
7.1 Data Integrity Mechanisms
Checksums on all data transfers shall be checked at the application layer, and is therefore outside the
scope of this document.
7.2 Link Integrity Mechanisms
Active monitoring procedures should be performed by the network monitoring tools to test link
integrity. Section 5 has defined some specific items that should be monitored in order to provide
information about the end-to-end link quality, from the layer 1 to layer 3 perspectives. But, sometimes,
in large data movement, it is important to simulate a data transfer to validate the circuit capacity and
the end-to-end link integrity from the application point-of-view.
One approach is the implement an active monitoring procedure. An active monitoring procedure may
be implemented to generate a big set of data and its data hash, and then send the data to the other end
host. At this remote host, then generate another data hash to compare. If these two hashes match, that
means that the link is straightforward. It is important to generate a set of data big enough to use the
whole bandwidth for more than one minute, to force the full utilization of the link. To accomplish this
test, perfSonar could be used. In case of poor performance, perfSonar will help the LSST:NET to isolate
the problem, as described in the section 7.3
7.3 Network/Link Performance Issues:
If the link has a physical cut in some segment, then it’s normally straight forward to detect the location
problem. However, if the link is not cut, but its performance is degraded, then the task of identifying
where the problem is gets to be more complicated. To detect conditions of degraded link performance
it is necessary to implement tools to gather information of physical indicators of the links so to be able
to identify in the least time possible where along the path there could be a link failure. Some of these
indicators are:



Status of circuits
Latency Charts
Segment by segment analysis
Poor performance may also be a result of the following factors: (a) client and server end hosts are not
properly tuned; (b) buffer size issues in hosts and network equipment; (c) misconfiguration of network
switches or routers; (d) firewall performance issues; (e) poor choice of file transfer tool. LSST:NET shall
adopt best practices and establish a common set of tools to conduct network performance tests.
One suggestion is to deploy PerfSonar servers in each party so segment-by-segment tests can be
performed. With this approach, it would be easy to detect where the poor performance is on the
network. If test results show that the network performance is within specified requirements, yet the
application is not performing as expected, then the LSST Lead Network Engineer will liaison with the
liaison of the applications team to further explore the issue and to coordinate resources towards
resolution.
8 Maintenance (per Segment)
Maintenance of the network segments must be scheduled and users must be notified with ample
advance notice; for example, 48 hours. The LSST:NET shall announce a periodic maintenance window to
10
perform routine maintenance on the network.
8.1 Maintenance Notification and Operations Calendar
LSST NET shall provide scheduled maintenance notifications to each participating NOC. The notification
method will be either (a) a broadcast mailing list composed of emails provided by LSST:NET of any
segments involved in a scheduled maintenance window; or (b) targeted individual entities, which shall
be notified of outages/maintenances affecting only their network connectivity.
Scheduled maintenances are to be further tracked via the ticketing system previously described and
monitored by the LSST NOC Service Desk staff. Maintenance may also be subject to Change
Management approvals.
The LSST NOC shall maintain an Operations Calendars available via the Web. These calendars include
Scheduled and Unscheduled Outages as well as Service Requests.
9 Installation and Integration
LSST will purchase services from the participating NOCs for the installation and maintenance of the
necessary equipment.
10 Management
10.1 Network Engineering Team
Engineers would be located as follows:
 1 Tucson (Lead Network Engineer, also on Network Architecture Team)
 1/2 NCSA (also on Network Architecture Team)
 1/2 AMPATH
 1/2 REUNA
 1 La Serena
10.2 Escalation List and Outage Notification Procedures
During the period of a service outage, an escalation procedure is executed to notify people on the
escalation list. The escalation procedure should contain information to identify the severity level and
what action to take. The escalation procedure should reference who is the responsible NOC for each
segment.
The LSST NET shall define the escalation procedure and escalation list for each segment of the network.
The LSST NET shall maintain this list current and accessible via a secure wiki/website.
10.2.1 AMPATH Escalation List
AMPATH NOC (hosted at Indiana University).
Ampath Network Engineer
LSST Network Lead Engineer
11
AMPATH Director
10.2.2 REUNA Escalation List
REUNA NOC
REUNA Network Engineer
LSST Network Lead Engineer
REUNA Director
10.2.3 NCSA Escalation List
NCSA Network Engineer
LSST Network Lead Engineer
NCSA Director
12
11 Appendix
11.1 Parameters for Network Alert Notifications
The following is a list of well-known parameters in the SNMP network management information base.
These parameters will be used to represent the status of the network to operators, network engineers,
and users using a common standards-based tool.
o
o
o
o
Interface up/down status
 Using the SNMP object “IF-MIB::ifOperStatus”, which monitors the interface
operational status (UP or DOWN), all interfaces of the LSST network should be
monitored. This object might be used to measure the overall availability of the
network;
 Each time the SNMP query method receive a DOWN value, an alert must be sent to
the LSST:NET. It’s recommended to wait at least 30-60 seconds before contacting
the network carrier to be sure it’s not just a flap. Multiple flaps must not be
observed.
Discards/Errors counters
 One factor that affects the performance of the network is amount of errors and/or
discards happening in the path. These errors might be CRC, duplex negotiation
issues or packet lengths, and the discards could reflect an over utilized interface
which is discarding packets because it’s receiving more packets/traffic than it
supports.
 It’s important to measure these interface counters, and the SNMP objects “IFMIB::ifInErrors” and “IF-MIB::ifOutErrors” are the best way to do it.
 These counters must be always zero.
Interface utilization
 The bandwidth utilization of each interface should be measure at least each 5
minutes interval, since bursts can happen and compromise the performance of all
application.
 It’s possible to measure this utilization using the SNMP Objects “IF-MIB::ifInOctets”
(incoming traffic) and “IF-MIB::ifOutOctets” (outbound traffic).
 Each time the threshold of 80% is reached, an alert must be sent to the LSST:NET.
Non-Unicast Packet rate
 Broadcast storms are very usual and dangerous problem for Layer 2-based
networks. It’s important to define a non-unicast packet rate to filter at all edge
interfaces, but even this control could not be enough. It’s important to monitor this
packet rate to avoid and understand some issues, like high CPU utilization, poor
performance, etc.
 It’s possible to measure this utilization using the SNMP Object “IFMIB::ifInNUcastPkts”
16

o
o
o
o
o
o
As the SNMP MIB to monitor Non-Unicast Packets includes Multicast Packets, the
network profile must be understood before implementing any alert. It’s important
to mention that this counter is for interface, not for VLAN. To counter VLAN NonUnicast traffic, a different approach should be deployed.
Network device CPU utilization
 The network device CPU is used in the control plane activities, as routing and
management activities. Broadcast storms, DoS attacks and interface flaps could
increase the CPU utilization and network adjacencies could be turned down, causing
instability of the network. So, it’s important to monitor this utilization.
 The network device CPU utilization is not part of the standard IF-MIB, so, each
network administrator will have to find out what is the OID MIB for its network
device. For example, when using Juniper devices, it’s possible to measure CPU
utilization with MIB “jnxOperatingCPU”. For Cisco it’s possible to use the CISCOPROCESS-MIB. All vendors have their MIB documentation available in their web
sites.
 Each time the threshold of 80% is reached for more than 30 seconds, an alert must
be sent to the LSST:NET.
IGP adjacency status (IS-IS, OSPF)
 Every time an IGP adjacency is lost, an alert must be sent to the LSST:NET
BGP neighbor status
 As LSST network will depend of Layer 3 connectivity to reach Internet2 and other
NRENS, the BGP will have an important role in the LSST project. All BGP sessions
must then be monitored.
 THE BGP neighbor status is also a vendor implementation; it’s not part of the
standard IF-MIB. All major network vendors have MIB to monitor the BGP neighbor
status. Each network administrator will have to find out what is the OID MIB for its
network device.
 Each time the SNMP query method receive a DOWN value, an alert must be sent to
the LSST:NET
BGP accepted/received prefixes
 After the moment the BGP session is established, some prefixes are advertised to
the BGP neighbor, and some prefixes are received from it. Sometimes when BGP
policies are changed, some prefixes could be added or removed from the BGP
tables, creating network instability. For this reason, it’s important to monitor the
amount of received and advertised prefixes.
 The SNMP MIB to monitor these counters via SNMP is part of the same MIB to
monitor the BGP neighbor status. Vendor documentation must be checked.
Management reachability
 It’s possible to monitor management reachability using simple ICMP Echo tests.
NTP status
 In a network where One-Way delay is important, all network devices must have NTP
servers installed and make sure that the system clock is accurate.
17


On the servers connected to external time source (GPS or CDMA) -called Stratum 0it’s possible to monitor the drift using NMS agents or SNMP MIB.
All network devices will become Stratum 1 connected to the Stratum 0 servers. To
monitor these devices, it’s also possible to use SNMP, but the SNMP is part of the
vendor implementation. Major vendors support this monitoring and documentation
must be checked.Any time that NTP server not synchronized an alert must be
generate to the LSST:NET.
18
11.2 Measurement Instrumentation
Measurement instrumentation shall be used in each network segment. perfSonar3 is a network
performance monitoring system used mostly by academic networks to identify end-to-end
performance problems on paths crossing several networks. It provides a measurement
framework by which other tools may be added to increase its functionality. As it supports
several different measurements, it is recommended that bandwidth and delay tests to be
configured in different nodes, since bandwidth tests could interfere in the delay tests results.
So, taking this recommendation into consideration, two approaches might be used:
1) Two perfSonar nodes: one for bandwidth and end-to-end tests and one for delay (RTT
and one-way) tests. If using this approach, it would be recommended for this second
perfSonar node to have an external clock source, such as a GPS or CDMA device. This
approach would allow the NOCs to have a very accurate one-way delay time;
2) One perfSonar node and one delay node. Using this approach, perfSonar node would be
used for bandwidth and end-to-end tests, while the second node would be responsible
for the delay (RTT) tests. This second node could be any device that supports ICMP and
Traceroute tests. The ATLAS RIPE probe is then suggested for the delay test.
The first option has the advantage of supporting one-way delay, but it requires two servers,
more rack space to install the second network server and access to the facilities roof to install
the GPS device, which may be impracticable.
The second option is less expensive, since a single server is required, instead of two. The ATLAS
RIPE probe has showed itself to be a very reliable solution. It doesn’t support one-way delay, but
this measurement is not a requirement for this plan. AmLight has been testing the ATLAS RIPE
probe, and recommends this approach. In the future, the solution should be re-evaluated to
decide if it needs to be replaced or complemented.
To be effective, as mentioned before, all networks in the path will need to support bandwidth
and delay tests. Below is the list of the current status of this deployment for each network:
1) AURA/Cerro Pachon: It will need a perfSonar node. Hardware is available. Installation of
the perfSonar is required. ATLAS probe to be installed.
2) AURA/La Serena: perfSonar node and ATLAS probe already operational.
3) AmLight/Santiago: Neither perfSonar node nor ATLAS probe are in the Level3 facilities in
Santiago (where the international links are received).
4) REUNA/Santiago: perfSonar node and ATLAS probe already operational.
5) ANSP/Sao Paulo: perfSonar hardware and ATLAS probe already operational.
6) AMPATH/Miami: perfSonar node and ATLAS probe are already operational.
7) FLR: No information about their deployment.
8) Internet2: Internet2 has perfSonar servers available in all their points of presence.
9) Starlight: No information about their deployment.
3
http://www.perfsonar.net/start.html
19
10) NCSA: There is a perfSonar node in Urbana-Champagne.
Below is the proposed topology for this measurement (pS = perfSonar, A = Atlas Probe). This
figure represents the recommendation described in this plan to utilize PerfSonar and Atlas
probes in each of the networks participating in the LSST Network and Operations plan.
Figure 2 Topoloy of perfSonar nodes and ATLAS probes
20