Download LSST Network Operations and Management Plan

1 Large Synoptic Survey Telescope (LSST) LSST Network Operations and Management Plan Julio Ibarra, Chip Cox, Sandra Jaque, Ron Lambert, Jeff Kantor, Jeronimo Aguiar, James Grace, Mike Freemon, Tim Boerner, Dave Wheeler, Albert Astudillo Document-11918 Latest Revision Date: 5/4/2017 5:04:00 PM Change Record Version Date Description Initial draft Owner name 1.0 8/3/11 1.1 6/12/13 1.2 7/12/13 1.3 7/24/13 1.4 7/26/13 Sections 4.4.1, 4.4.2, and 4.4.3 added Jeronimo Aguiar 1.5 8/9/13 Contributions on sections 2.1, 2.2, 2.4 and 4.4 Sandra Jaque 1.5 8/9/13 Updates to sections 2, 3, 4, 5 Chip Cox 1.5 8/22/13 Updates to sections 2, 5 and 8 Julio Ibarra 1.6 8/23/13 1.7 9/6/13 1.8 9/8/13 1.9 9/20/13 Revisions to sections 3, 4.2, 4.3.1, 4.4, 5 Julio Ibarra 1.10 10/16/13 Appendix 12.2 Measurement Instrumentation Jeronimo Aguiar 1.11 11/8/13 Provided comments to various sections Jeronimo Aguiar 1.12 11/15/13 Update to section 5.3 Sandra Jaque 1.13 11/21/13 Incorporated Jeronimo’s comments; section 4.4.6 Julio Ibarra Revised initial draft with Long-Haul Network review recommendations Added comments and prepared document for review by working group Replaced Figure 1. Modified version 1.2 using notes form the last call Incorporated contributions from Sandra and Chip. Section 2.4 Section 5, and Appendix on parameters for alert nofification Added Comments with assignments from the call on 9/6 Julio Ibarra Jeff Kantor Julio Ibarra Julio Ibarra Julio Ibarra Jeronimo Aguiar Julio Ibarra 1 1.14 11/22/13 Updates to section 7.3; merged part of section 9 to section 1; replaced Figure 1. Julio Ibarra 1.14 11/26/13 Updates to section 7 and several other sections Jeronimo Aguiar 2 Table of Contents Change Record ...................................................................................................................... 1 1 Summary ......................................................................................................................... 1 2 Network Services ............................................................................................................. 1 2.1 Ethernet Transport Service ....................................................................................................2 2.2 IP routed services..................................................................................................................2 2.3 Bandwidth Management Services..........................................................................................2 2.4 Diagnostic Services and Tools ................................................................................................2 2.4.1 End-point Reachability ............................................................................................................. 2 2.4.2 Circuit Latency .......................................................................................................................... 2 2.4.3 Circuit Throughput ................................................................................................................... 3 3 Responsibilities ............................................................................................................... 3 3.1 4 Relationships between LSST and the Parties ..........................................................................3 NOC Services ................................................................................................................... 4 4.1 Hours of Operations - NOC Service Desk ................................................................................4 4.2 NOC Services System .............................................................................................................4 4.3 Service Requests ...................................................................................................................4 4.3.1 Submitting a Service Request to the LSST NET......................................................................... 5 4.3.2 Tracking Service Requests: ....................................................................................................... 5 4.4 NOC System Tools and Components: .....................................................................................5 4.4.1 Weather Map ........................................................................................................................... 5 4.4.2 Status of Circuits....................................................................................................................... 6 4.4.3 Latency Charts .......................................................................................................................... 6 4.4.4 Trouble Ticket System .............................................................................................................. 6 4.4.5 Network Performance Monitoring System .............................................................................. 6 4.4.6 LSST NET Website ..................................................................................................................... 7 5 Monitoring, Network Alert Notifications and Outage Management .................................. 7 5.1 5.2 5.3 5.4 Real-time Network Monitoring system ..................................................................................7 Monitoring tools ...................................................................................................................7 Router Proxy .........................................................................................................................8 Outage Management ............................................................................................................8 6 Reporting ........................................................................................................................ 9 7 Performance and Tuning.................................................................................................. 9 7.1 7.2 7.3 8 Data Integrity Mechanisms .................................................................................................. 10 Link Integrity Mechanisms ................................................................................................... 10 Network/Link Performance Issues: ...................................................................................... 10 Maintenance (per Segment) .......................................................................................... 10 8.1 Maintenance Notification and Operations Calendar ............................................................. 11 9 Installation and Integration ........................................................................................... 11 10 Management............................................................................................................... 11 1 10.1 Network Engineering Team ............................................................................................... 11 10.2 Escalation List and Outage Notification Procedures ............................................................ 11 10.2.1 AMPATH Escalation List ....................................................................................................... 11 10.2.2 REUNA Escalation List .......................................................................................................... 12 10.2.3 NCSA Escalation List ............................................................................................................. 12 11 Appendix..................................................................................................................... 16 11.1 11.2 Parameters for Network Alert Notifications ....................................................................... 16 Measurement Instrumentation ......................................................................................... 19 2 1 Summary This document addresses the operational relationship between parties providing Network Engineering Team (NET) Services for LSST (known as LSST:NET). The parties are AURA/LSST Corporation (herein referred to as LSST), Florida International University (FIU), the National Center for Supercomputing Applications (NCSA), and Red Universitaria Nacional (REUNA). THIS DOCUMENT IS NOT A CONTRACT OR LEGAL DOCUMENT AND IS THEREFORE NON-BINDING AS AMONG THE PARTIES PREVIOUSLY DESCRIBED OTHER THAN AS AN ADDENDUM TO A SEPARATE CONTRACT BETWEEN SAID PARTIES. The overall network design is described in LSE-78 LSST Observatory Network Design. In addition, the network is subject to the security plans and procedures described in LSE-99 LSST Cybersecurity Plan. In case of any conflict between this document and LSE-78 or LSE-99, LSE-78 and LSE-99 shall have precedence. Specifically, this document aims to describe the roles of each of the Institutions supporting the transport and security of LSST data to its archive facilities at NCSA and LSST in Tucson. The goal of this approach is to have a centralized internally staffed LSST Network Architecture Team (NAT), supported by a set of geographically distributed, integrated Network Operations Centers (NOCs) and engineers operating as a single coordinated operations and engineering team for LSST – the LSST Network Engineering Team (LSST:NET). The LSST:NET will consist of the NAT and each of the participating NOCs. This document is aimed at describing the services required by LSST and establishing guidelines, expectations and a general overview of services roles and responsibilities of each NOC; specifically, as related to Maintenance (per segment), Outage Management, Performance and Tuning, Integration and Configuration, Provisioning and Installation, Contracting and Management. As LSST approaches operations, this plan will be expanded with detailed procedures, checklists, and other documentation in each area. Parties agree that unilateral amendments to this document are prohibited. Provisions materially or substantially affecting the scope of work as herein described may not be added, altered or removed without  Formal, written approval of all changes by LSST NET prior to implementation of the change(s).  Documentation shall be provided by the party providing the change describing the foreseeable impact to the scope of work at least 1 week before said meeting is to take place. 2 Network Services LSST will depend upon the following network services Ethernet Transport (end-to-end) IP routed services for access to the Commodity Internet, Internet2, NLR and other backbone networks Bandwidth Management Services Diagnostic services and tools 1 2.1 Ethernet Transport Service Ethernet is a widely adopted transport protocol to establish persistent end-to-end circuits. The R&E networking community has adopted VLAN tagging (IEEE 802.1Q) as a methodology to extend an Ethernet segment across multiple administrative domains. LSST data will be traversing multiple network domains from La Serena to NCSA. The network operators responsible for each network domain shall establish a process for the provisioning of Vlans, end-to-end, either statically or dynamically. 2.2 IP routed services IP routed services for access to the Commodity Internet, Internet2, NLR and other backbone networks shall be the traditional methodology for Inter-domain network communications using layer 3 protocols. LSST will need to interconnect to Commodity Internet, as well as Internet2, NLR and other R&E backbone networks. The Layer 3 protocol that will be used between the different parties is the Border Gateway Protocol (BGP) using double stack prefixes IPv4 and IPv6. Internal to each party, the routing protocol is an internal choice that doesn’t affect the overall routing. 2.3 Bandwidth Management Services A set of tests shall be performed on a regular basis to ensure LSST’s network requirements are satisfied. Traffic characteristics for LSST, such as bulk data movement, rapid movement of GB sized files, etc., shall be tested routinely to verify network requirements. Bandwidth (Wire Speed), Latency, Jitter, Packet Rate, Maximum Payload Throughput, etc. are tests that will be performed to verify the availability of bandwidth. The next section, 2.4, describes tests and tools for verifying bandwidth. 2.4 Diagnostic Services and Tools LSST:NET shall establish a baseline set of diagnostic tools in order to be able to conduct a baseline set of tests to verify the end-to-end performance and proper operation of the network. The following subsections describe several tests and tools to include in the baseline diagnostic tests for the network. 2.4.1 End-point Reachability End point reachability involves determining if the end host IP address is reachable across the network. The LSST NOC shall use a tool that will provide a representation of all the end points that must be reachable. The tool shall be configurable to set intervals for testing reachability. AmLight suggests up to a 5 minute interval for testing end point reachability. Suggested tools are: ping, smokeping, zabbix, nagios. 2.4.2 Circuit Latency Latency is a unit of time, normally measured in milliseconds, of the delay interval of time between the sender and the receiver. Latency is measured either one-way (the time from the source sending a packet to the destination receiving it), or round trip (the one-way latency from source to destination plus the one-way latency from the destination1 back to the source). ICMP and OWAMP are protocols used to measure latency. Tools may be configured to perform latency tests at regular intervals. AmLight suggests 5 minute test intervals. Suggested tools are: ping, hping, owamp, zabbix, nagios, perfSonar. 1 A more precise definition also factors in the time spent by the destination to both process the incoming packet and to send an answer back to the source. 2 2.4.3 Circuit Throughput Throughput on an Ethernet circuit is the maximum rate at which none of the offered frames are dropped by any device in the path. Bandwidth tests should be scheduled and performed regularly. Full circuit speed tests are suggested every twelve hours (0000, 1200 UTC-4), using 3 Minute UDP tests, and 6 Minute TCP tests. Test durations can be altered as needed. Suggested Tools are iperf, nuttcp, bwctl, perfSonar. Verification may be performed using SNMP through interface counters. 3 Responsibilities Figure 1 depicts the network segments within the LSST Observatory Network and which party is primarily responsible for Operations and Management. Refer to LSE-78 for complete discussion of network segments. The parties responsible for the Operations and Management of the network segments for LSST are organized into the following three groups: First, the LSST Network Architecture Team (NAT). The LSST Lead Network Engineer is a member of the NAT. Second, the LSST Network Engineering Team (LSST:NET). Third, Participating Network Operations Centers (NOCs), constituted by AMPATH, REUNA, and LSST. Figure 1: LSST Network Segments 3.1 Relationships between LSST and the Parties The segment from the Summit Site on Cerro Pachon to the Base Site in La Serena shall be contracted and operated by LSST. The network segments from Santiago to La Serena shall be contracted and operated by REUNA. Network segments from Santiago to Chicago shall be contracted by FIU, as part of 3 the AMLIGHT project. The network segment from Chicago to Champaign shall be contracted by NCSA. In particular, the following table defines responsibility for each segment. Table 1 Contracting Party and Responsible NOC for each network segment Network Segment: SANTIAGO – PANAMA PANAMA – Los Angeles SANTIAGO – SAO PAOLO Sao Paulo - Miami MIAMI – CHICAGO LOS ANGELES – TUCSON TUCSON - CHICAGO CHAMPAIGN – CHICAGO LA SERENA – SANTIAGO CERRO PACHON – LA SERENA Contracting Party: FIU FIU FIU FIU FIU FIU FIU NCSA REUNA LSST Responsible NOC: AMPATH AMPATH, CENIC AMPATH, ANSP AMPATH AMPATH AMPATH, CENIC AMPATH, CENIC NCSA REUNA LSST 4 NOC Services The following are a list of required operational criteria that the integrated NOC must at all times maintain. 4.1 Hours of Operations - NOC Service Desk The LSST NOC must maintain a Service Desk, staffed by Operators and reachable at all times: 24 hours a day, 7 days a week and 365 days a year. 4.2 NOC Services System A system shall be identified that provides the required NOC Services for LSST. This LSST NOC Services System shall facilitate communication and coordination among participating NOCs. All of the participating NOCs shall implement the NOC Services System. Interoperability between the chosen LSST NOC Services System and other NOC systems in use at participating NOCs would be highly desirable. Each participating NOC shall utilize this system to receive network alert notifications and other event driven information about the operational status of the network. The NOC Services system will include a trouble ticket database, which will be integrated with the overall LSST operations trouble ticket database. These services will be available in an integrated fashion via a portal. 4.3 Service Requests The LSST NOC shall receive requests through service request channels. Service request channels refer to bidirectional communication channels, such as a publicly accessible e-mail address, phone number, website, messaging-chat system, etc. Service requests to the LSST:NET will be submitted by the LSST Lead Network Engineer, who is the leader of the NAT. If a request arrives at the general ticket system, and if it’s routed to the internal network engineering team queue, the Lead Network Engineer will determine if it should be forwarded to the internal LSST NET. This is a filtered process so that all requests do not go to the internal NOCs. 4 4.3.1 Submitting a Service Request to the LSST NET To submit a request to the LSST NET, the channels that shall be made available are a web form, an email or voice communications. 4.3.2 Tracking Service Requests: All service requests shall be tracked via a ticket, automatically issued from an Issue tracking system, such as JIRA2. All actions performed on a service request shall be recorded in the Service Request form. 4.4 NOC System Tools and Components: The NOC System shall consist of a set of standard tools to represent the status of all the network segments that LSST data traverses. The following is a well-known preliminary set of tools used for reporting the status of the network and to report network problems. A web-based portal will be used by each of the NOCs to access the system tools. Tools: Weather Map Status of Circuits Latency Charts Trouble Ticket System Network Performance Monitoring System LSST NET Website Participating NOC Resource Provider: AMPATH AMPATH AMPATH LSST REUNA/LSST/AMPATH LSST/REUNA/AMPATH The following sections provide brief descriptions for each of the NOC System tools and components. 4.4.1 Weather Map Network Weathermap is a network visualization tool to show network utilization on a per segment basis in a map form. The Weather Map tool may be used to create a map of the network topology of each of the LSST network segments and report on their utilization. Below is an example of the weather map tool used at AMPATH-AmLight. 2 JIRA is developed by Atlassian. It’s a web-based application that’s currently being used by the LSST Data Management team for bug tracking. It can be used to create a home page with links to other tools. LSST is moving to implement products from Atlassian. JIRA could be used by NOCs in this plan. 5 4.4.2 Status of Circuits Status of circuits may be monitored by a SNMP software with two main goals: (1) monitor if the circuit is operational, and (2) measure its utilization. Using SNMP, port counters of all devices in the end-to-end path may be monitored. Using counters it is also possible to monitor errors in the interface, as packet errors due to link problems or MTU issues. 4.4.3 Latency Charts To complement the status of circuits monitoring, it’s important to evaluate the latency between all devices in the path. Tools for measuring latency were described in section 2.4.2. Below is an example of a latency chart: Using Latency Charts helps the NOC to proactively detect an issue in the path, errors and switch protection in the carriers. 4.4.4 Trouble Ticket System A trouble ticket system is used to assign and coordinate tasks and to manage requests among a community of users. It may be used to track events, failures and issues affecting LSST users. 4.4.5 Network Performance Monitoring System The network performance monitoring system for LSST shall provide a collection of tools to monitor and measure each network segment that crosses multiple network domains. Tools shall be able to gather 6 metrics for both passive and active measurements. Active Measurements include the following metrics: Achievable Bandwidth, One-way delay, Layer 3 Path, Round Trip Delay. Passive Measurements include the following metrics: Layer 1 and Layer 2 statistics (e.g., SNMP); Flow Observation (e.g., Netflow, sFlow). 4.4.6 LSST NET Website A representation of the LSST network and its segments (described in section 3 and Figure 1), shall be on the LSST Network website. This web site shall display real-time status information, such as alerts. A weather map is a possible representation of the LSST network and its segments. The LSST NET and each of the participating NOCs shall each display and monitoring this information on the LSST NET web site. 5 Monitoring, Network Alert Notifications and Outage Management The LSST NOC shall provide proactive monitoring of all elements of the LSST network, and shall generate service alert alarms for a variety of network-centric services, including but not limited to: o o o o o o o o o o o Interface up/down status Discards/Errors counters Interface utilization Non-Unicast Packet rate Network device CPU utilization IGP adjacency status (IS-IS, OSPF) BGP neighbor status BGP accepted/received prefixes Management reachability NTP status Laser Received Power Level These and other alert parameters may be monitored using standard SNMP-based tools, such as Nagios, Cacti, Zabbix, Tivoli Netview, CA, etc. One or more of these web-based network monitoring tools that support SNMP shall be used to provide the LSST NOC with proactive monitoring functions, with an overall view of the network, including a weathermap of the network. A more complete description of each parameter may be found in the Appendix. 5.1 Real-time Network Monitoring system A Real-time network monitoring system shall consist of the NOC System Tools and Components described in section 4.4. Monitoring tools shall be configured to monitor for events of possible outage, then to send out alert notifications to parties responsible for responding to events. The tools listed in section 4.4 shall provide reports to show that the network links are operational and performing as required. 5.2 Monitoring tools A suite of tools shall be established for LSST Network Operations to monitor and test the health of the network. Such tools may be based on current tools that support inter-domain performance monitoring, such as perfSONAR. Appendix 11.2 contains a recommendation for the use of perfSonar to perform at least two very important tests: (1) measure bandwidth for end-to-end tests; and (2) network delay. 7 5.3 Router Proxy A Router Proxy provides a web-based tool interface, allowing users to query a router through the web interface instead of having to be directly connected to the device. A router proxy is commonly used for troubleshooting, such as to check the IP route advertisement, BGP path, traceroute, ping, and other commands a router is permitted to execute via the proxy. Output from the router is then displayed in the web interface. There are different implementations for router proxies. Router proxies may be implemented with public access, or private, limiting access to directly connected users. Typically, it is the NOC administrator who decides which commands may be executed in a router proxy. For example, NOC administrators may limit execution to a single command, such as a traceroute. Router proxies currently operating are the following:  AMPATH: http://routerproxy.grnoc.iu.edu/ampath/  Internet2: provides router proxy services for its IP network at o http://routerproxy.grnoc.iu.edu/internet2/, and its AL2S 100G network at http://routerproxy.grnoc.iu.edu/al2s/  ESNET Router Server: http://traceroute.es.net/cgi-bin/trace  REUNA has a private router proxy service that requires credentials to access. REUNA has offered to include the AURA border router within the scope of its router proxy. This will permit the AURA border router to be queried. 5.4 Outage Management Outages are events that result in loss of services. The monitoring tools and alert notifications described previously are used to detect outages and the events that caused them. Events that result in outages normally are fiber cable cuts, failure of active equipment, operator error, etc. Monitoring tools track events. Events signal possible outages. Outage Management functions:  Localization of the outage.  Extent of the outage: determination, dissemination of information  Restoration efforts: information gathering, dissemination, coordination Participating NOCs and engineering teams will use an internal trouble ticket system to track network issues and coordinate among themselves. Information shall be disseminated to upstream providers or downstream organizations affected by the outage via e-mail or phone calls, and provide updates to network engineers as reasonably needed. Outage management/monitoring tools must report alert notifications to guide the LSST:NET Team in the resolution of outages and restoration of services. When an alert notification occurs, the LSST:NET shall perform a procedure that corresponds to the alert. Alarms should be seen by all NET members. The NOC responsible for handling the event, immediately starts executing a corrective action procedure. Table 2 below provides a non-exhaustive list of alert notification messages and a description of the corresponding procedure. 8 Table 2 Alert Nofication with corresponding Procedure Alert Notification Message: Procedure to be executed: Link DOWN/UP Identify the segment; responsible NOC should open a ticket with the responsible Carrier or colocation facility. BGP Session DOWN/UP Identify the BGP router; responsible NOC s should identify the cause and restore the session High CPU Utilization Identify the router; responsible NOC should identify the cause and solve the problem. Data/Time Not Synchronized Identify the NTP server; responsible NOC should identify the cause and fix the synchronization. BGP Received Prefixes has changed over 20-50% Identify the BGP router; responsible NOCs should work together to understand what happened and define if an action should be taken. Discards/Errors detected on Interface Identify the router; responsible NOC should contact the carrier and/or the colocation facility. Delay/RTT has changed more than 20-50% Identify which segment had its delay increased; responsible NOC should call the Carrier Received Power Laser Level is too low (<15 dBm) Identify the port and the router; Call Colo Facility responsible for the cross-connection. Maybe it would also be necessary to call the carrier A probe or a perfSonar server is not responding Identify the device; responsible NOC should work to fix the problem. 6 Reporting Routinely generate monthly, quarterly, and annual reports that reflect outage and maintenance activity, network availability statistics, and general trouble ticket analysis. Report generation shall be performed by the LSST Lead Network Engineer. Reports shall be made available on the LSST NET website under “Support” or “Reports”: These reports shall also be used for general overview of the LSST network operations and engineering team during regularly scheduled NOC Operations calls. 7 Performance and Tuning A performance monitoring system will be deployed to measure the performance of the end-to-end link from La Serena to NCSA. The performance monitoring tool can be used to verify the performance of the end-to-end link. In particular, link integrity is an important test when data corruption is reported or suspected. The NET shall provide a procedure for users to test the performance of the link and to verify that the network is performing as expected, and not the cause of a data integrity problem. 9 7.1 Data Integrity Mechanisms Checksums on all data transfers shall be checked at the application layer, and is therefore outside the scope of this document. 7.2 Link Integrity Mechanisms Active monitoring procedures should be performed by the network monitoring tools to test link integrity. Section 5 has defined some specific items that should be monitored in order to provide information about the end-to-end link quality, from the layer 1 to layer 3 perspectives. But, sometimes, in large data movement, it is important to simulate a data transfer to validate the circuit capacity and the end-to-end link integrity from the application point-of-view. One approach is the implement an active monitoring procedure. An active monitoring procedure may be implemented to generate a big set of data and its data hash, and then send the data to the other end host. At this remote host, then generate another data hash to compare. If these two hashes match, that means that the link is straightforward. It is important to generate a set of data big enough to use the whole bandwidth for more than one minute, to force the full utilization of the link. To accomplish this test, perfSonar could be used. In case of poor performance, perfSonar will help the LSST:NET to isolate the problem, as described in the section 7.3 7.3 Network/Link Performance Issues: If the link has a physical cut in some segment, then it’s normally straight forward to detect the location problem. However, if the link is not cut, but its performance is degraded, then the task of identifying where the problem is gets to be more complicated. To detect conditions of degraded link performance it is necessary to implement tools to gather information of physical indicators of the links so to be able to identify in the least time possible where along the path there could be a link failure. Some of these indicators are:    Status of circuits Latency Charts Segment by segment analysis Poor performance may also be a result of the following factors: (a) client and server end hosts are not properly tuned; (b) buffer size issues in hosts and network equipment; (c) misconfiguration of network switches or routers; (d) firewall performance issues; (e) poor choice of file transfer tool. LSST:NET shall adopt best practices and establish a common set of tools to conduct network performance tests. One suggestion is to deploy PerfSonar servers in each party so segment-by-segment tests can be performed. With this approach, it would be easy to detect where the poor performance is on the network. If test results show that the network performance is within specified requirements, yet the application is not performing as expected, then the LSST Lead Network Engineer will liaison with the liaison of the applications team to further explore the issue and to coordinate resources towards resolution. 8 Maintenance (per Segment) Maintenance of the network segments must be scheduled and users must be notified with ample advance notice; for example, 48 hours. The LSST:NET shall announce a periodic maintenance window to 10 perform routine maintenance on the network. 8.1 Maintenance Notification and Operations Calendar LSST NET shall provide scheduled maintenance notifications to each participating NOC. The notification method will be either (a) a broadcast mailing list composed of emails provided by LSST:NET of any segments involved in a scheduled maintenance window; or (b) targeted individual entities, which shall be notified of outages/maintenances affecting only their network connectivity. Scheduled maintenances are to be further tracked via the ticketing system previously described and monitored by the LSST NOC Service Desk staff. Maintenance may also be subject to Change Management approvals. The LSST NOC shall maintain an Operations Calendars available via the Web. These calendars include Scheduled and Unscheduled Outages as well as Service Requests. 9 Installation and Integration LSST will purchase services from the participating NOCs for the installation and maintenance of the necessary equipment. 10 Management 10.1 Network Engineering Team Engineers would be located as follows:  1 Tucson (Lead Network Engineer, also on Network Architecture Team)  1/2 NCSA (also on Network Architecture Team)  1/2 AMPATH  1/2 REUNA  1 La Serena 10.2 Escalation List and Outage Notification Procedures During the period of a service outage, an escalation procedure is executed to notify people on the escalation list. The escalation procedure should contain information to identify the severity level and what action to take. The escalation procedure should reference who is the responsible NOC for each segment. The LSST NET shall define the escalation procedure and escalation list for each segment of the network. The LSST NET shall maintain this list current and accessible via a secure wiki/website. 10.2.1 AMPATH Escalation List AMPATH NOC (hosted at Indiana University). Ampath Network Engineer LSST Network Lead Engineer 11 AMPATH Director 10.2.2 REUNA Escalation List REUNA NOC REUNA Network Engineer LSST Network Lead Engineer REUNA Director 10.2.3 NCSA Escalation List NCSA Network Engineer LSST Network Lead Engineer NCSA Director 12 11 Appendix 11.1 Parameters for Network Alert Notifications The following is a list of well-known parameters in the SNMP network management information base. These parameters will be used to represent the status of the network to operators, network engineers, and users using a common standards-based tool. o o o o Interface up/down status  Using the SNMP object “IF-MIB::ifOperStatus”, which monitors the interface operational status (UP or DOWN), all interfaces of the LSST network should be monitored. This object might be used to measure the overall availability of the network;  Each time the SNMP query method receive a DOWN value, an alert must be sent to the LSST:NET. It’s recommended to wait at least 30-60 seconds before contacting the network carrier to be sure it’s not just a flap. Multiple flaps must not be observed. Discards/Errors counters  One factor that affects the performance of the network is amount of errors and/or discards happening in the path. These errors might be CRC, duplex negotiation issues or packet lengths, and the discards could reflect an over utilized interface which is discarding packets because it’s receiving more packets/traffic than it supports.  It’s important to measure these interface counters, and the SNMP objects “IFMIB::ifInErrors” and “IF-MIB::ifOutErrors” are the best way to do it.  These counters must be always zero. Interface utilization  The bandwidth utilization of each interface should be measure at least each 5 minutes interval, since bursts can happen and compromise the performance of all application.  It’s possible to measure this utilization using the SNMP Objects “IF-MIB::ifInOctets” (incoming traffic) and “IF-MIB::ifOutOctets” (outbound traffic).  Each time the threshold of 80% is reached, an alert must be sent to the LSST:NET. Non-Unicast Packet rate  Broadcast storms are very usual and dangerous problem for Layer 2-based networks. It’s important to define a non-unicast packet rate to filter at all edge interfaces, but even this control could not be enough. It’s important to monitor this packet rate to avoid and understand some issues, like high CPU utilization, poor performance, etc.  It’s possible to measure this utilization using the SNMP Object “IFMIB::ifInNUcastPkts” 16  o o o o o o As the SNMP MIB to monitor Non-Unicast Packets includes Multicast Packets, the network profile must be understood before implementing any alert. It’s important to mention that this counter is for interface, not for VLAN. To counter VLAN NonUnicast traffic, a different approach should be deployed. Network device CPU utilization  The network device CPU is used in the control plane activities, as routing and management activities. Broadcast storms, DoS attacks and interface flaps could increase the CPU utilization and network adjacencies could be turned down, causing instability of the network. So, it’s important to monitor this utilization.  The network device CPU utilization is not part of the standard IF-MIB, so, each network administrator will have to find out what is the OID MIB for its network device. For example, when using Juniper devices, it’s possible to measure CPU utilization with MIB “jnxOperatingCPU”. For Cisco it’s possible to use the CISCOPROCESS-MIB. All vendors have their MIB documentation available in their web sites.  Each time the threshold of 80% is reached for more than 30 seconds, an alert must be sent to the LSST:NET. IGP adjacency status (IS-IS, OSPF)  Every time an IGP adjacency is lost, an alert must be sent to the LSST:NET BGP neighbor status  As LSST network will depend of Layer 3 connectivity to reach Internet2 and other NRENS, the BGP will have an important role in the LSST project. All BGP sessions must then be monitored.  THE BGP neighbor status is also a vendor implementation; it’s not part of the standard IF-MIB. All major network vendors have MIB to monitor the BGP neighbor status. Each network administrator will have to find out what is the OID MIB for its network device.  Each time the SNMP query method receive a DOWN value, an alert must be sent to the LSST:NET BGP accepted/received prefixes  After the moment the BGP session is established, some prefixes are advertised to the BGP neighbor, and some prefixes are received from it. Sometimes when BGP policies are changed, some prefixes could be added or removed from the BGP tables, creating network instability. For this reason, it’s important to monitor the amount of received and advertised prefixes.  The SNMP MIB to monitor these counters via SNMP is part of the same MIB to monitor the BGP neighbor status. Vendor documentation must be checked. Management reachability  It’s possible to monitor management reachability using simple ICMP Echo tests. NTP status  In a network where One-Way delay is important, all network devices must have NTP servers installed and make sure that the system clock is accurate. 17   On the servers connected to external time source (GPS or CDMA) -called Stratum 0it’s possible to monitor the drift using NMS agents or SNMP MIB. All network devices will become Stratum 1 connected to the Stratum 0 servers. To monitor these devices, it’s also possible to use SNMP, but the SNMP is part of the vendor implementation. Major vendors support this monitoring and documentation must be checked.Any time that NTP server not synchronized an alert must be generate to the LSST:NET. 18 11.2 Measurement Instrumentation Measurement instrumentation shall be used in each network segment. perfSonar3 is a network performance monitoring system used mostly by academic networks to identify end-to-end performance problems on paths crossing several networks. It provides a measurement framework by which other tools may be added to increase its functionality. As it supports several different measurements, it is recommended that bandwidth and delay tests to be configured in different nodes, since bandwidth tests could interfere in the delay tests results. So, taking this recommendation into consideration, two approaches might be used: 1) Two perfSonar nodes: one for bandwidth and end-to-end tests and one for delay (RTT and one-way) tests. If using this approach, it would be recommended for this second perfSonar node to have an external clock source, such as a GPS or CDMA device. This approach would allow the NOCs to have a very accurate one-way delay time; 2) One perfSonar node and one delay node. Using this approach, perfSonar node would be used for bandwidth and end-to-end tests, while the second node would be responsible for the delay (RTT) tests. This second node could be any device that supports ICMP and Traceroute tests. The ATLAS RIPE probe is then suggested for the delay test. The first option has the advantage of supporting one-way delay, but it requires two servers, more rack space to install the second network server and access to the facilities roof to install the GPS device, which may be impracticable. The second option is less expensive, since a single server is required, instead of two. The ATLAS RIPE probe has showed itself to be a very reliable solution. It doesn’t support one-way delay, but this measurement is not a requirement for this plan. AmLight has been testing the ATLAS RIPE probe, and recommends this approach. In the future, the solution should be re-evaluated to decide if it needs to be replaced or complemented. To be effective, as mentioned before, all networks in the path will need to support bandwidth and delay tests. Below is the list of the current status of this deployment for each network: 1) AURA/Cerro Pachon: It will need a perfSonar node. Hardware is available. Installation of the perfSonar is required. ATLAS probe to be installed. 2) AURA/La Serena: perfSonar node and ATLAS probe already operational. 3) AmLight/Santiago: Neither perfSonar node nor ATLAS probe are in the Level3 facilities in Santiago (where the international links are received). 4) REUNA/Santiago: perfSonar node and ATLAS probe already operational. 5) ANSP/Sao Paulo: perfSonar hardware and ATLAS probe already operational. 6) AMPATH/Miami: perfSonar node and ATLAS probe are already operational. 7) FLR: No information about their deployment. 8) Internet2: Internet2 has perfSonar servers available in all their points of presence. 9) Starlight: No information about their deployment. 3 http://www.perfsonar.net/start.html 19 10) NCSA: There is a perfSonar node in Urbana-Champagne. Below is the proposed topology for this measurement (pS = perfSonar, A = Atlas Probe). This figure represents the recommendation described in this plan to utilize PerfSonar and Atlas probes in each of the networks participating in the LSST Network and Operations plan. Figure 2 Topoloy of perfSonar nodes and ATLAS probes 20

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download LSST Network Operations and Management Plan