Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Control and Data Networks Architecture Proposal S. Stancu, C. Meirosu, B. Martin 31 May, 2006 Stefan Stancu 1 Outline • Overview of the TDAQ system and networks • Technology and equipment • TDAQ networks: Control network - no special bandwidth requirement Dedicated data networks: DataFlow network – high bandwidth (~100Gbit/s cross-sectional bw.) and minimal loss BackEnd network – high bandwidth (~ 50Gbit/s cross-sectional bw.) Switch management network • Conclusions 31 May, 2006 Stefan Stancu 2 ATLAS TDAQ System/Networks Network root (to ATCN) Control Links Data Links Management links Online Racks High availability Management Control Network 24/7 when the accelerator is Servers running ROB ROB ROB ROB ROB ROB ROS PC ROS PC ROB ROB ROB ROB ROB ROB DataFlow Network SVs Management Network SFIs SFO Switch L2PUs BackEnd Network Mass storage SFOs EFPs 31 May, 2006 EFPs EFPs Stefan Stancu 3 Technology and equipment • Ethernet is the dominant technology for LANs TDAQ’s choice for networks (see [1]) multi-vendor, long term support, commodity (on-board GE adapters), etc. Gigabit and TenGigabit Ethernet Use GE for end-nodes 10GE whenever the bandwidth requirements exceed 1Gbit/s • Multi-vendor Ethernet switches/routers available on the market: Chassis-based devices ( ~320 Gbit/s switching) GE line-cards: typically ~40 ports (1000BaseT) 10GE line-cards: typically 4 ports (10GBaseSR) Pizza-box devices (~60 Gbit/s switching) 24/48 GE ports (1000BaseT) Optional 10GE module with 2 up-links (10GBaseSR) 31 May, 2006 Stefan Stancu 4 Architecture and management principles • The TDAQ networks will typically employ a multi-layer architecture: Aggregation layer. The computers installed in the TDAQ system at Point 1 are grouped in racks, therefore it seems natural to use a rack level concentrator switch where the bandwidth constraints allow this. Core layer. The network core is composed of chassis-based devices, receiving GE or 10GE up-links from the concentrator switches, and direct GE connections from the applications with high bandwidth requirements. • The performance of the networks themselves will be continuously monitored for availability, status, traffic loss or other errors. In-band management – The “in-band” approach allows intelligent network monitoring programs not only to indicate the exact place of a failure (root cause analysis) but also what are the repercussions of a failure to the rest of the network.. Out-of-band management – A dedicated switch management network (shown in blue in Figure 1) provides the communication between management servers (performing management and monitoring functions) and the “out-of-band” interface of all network devices from the control and data networks. 31 May, 2006 Stefan Stancu 5 Resilient Ethernet networks • What happens if a switch or link fails? Phone call, but nothing critical should happen after a single failure. • Networks are made resilient by introducing redundancy: Component-level redundancy: deployment of devices with built-in redundancy (PSU, supervision modules, switching fabric) Network-level redundancy: deployment of additional devices/links in order to provide alternate paths between communicating nodes. Protocols are needed to correctly (and efficiently) deal with multiple paths in the network [2]: Layer 2 protocols: Link aggregation (trunking), spanning trees (STP, RSTP, MSTP) Layer 3 protocols: virtual router redundancy (VRRP) for static environments, dynamic routing protocols (e.g. RIP, OSPF). 31 May, 2006 Stefan Stancu 6 Control network 10 GE Control Links Network root (to ATCN) • ~3000 end nodes.Design assumption: 4 Online racks No end-node will generate control traffic in excess to the capacity of a GE line. The aggregated control traffic external to a rack of PCs (with the exception of the Online Services rack) will not exceed the capacity of a GE line. • • • The network core is implemented with two chassis devices redundantly interconnected by two 10GE lines. A rack level concentration switch can be deployed for all units except for critical services. Protocols/Fault tolerance: Conc. Core 2 Conc. Conc. Conc. 17 LVL2 racks 1 DC Control rack Resiliency: VRRP in case of Router/up-link failures Interface bonding for the infrastructure servers. Management Servers Core 1 Layer 3 routed network Static routing at the core should suffice. One sub-net per concentrator switch Small broadcast domains potential layer 2 problems remain local. GE Control Links 4 SFI racks 57 EF racks Conc. 3 SFO racks DataFlow Network BackEnd Network SFO Switch 31 May, 2006 Stefan Stancu 7 DataFlow network ~160 ROSs ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROS PC ROS PC A ~20 ROS conc. switches B ROS PC ROS PC A ROS conc. ROB ROS PC ROS PC A B ROB ROB ROB ROB ROB B ROS conc. ROS conc. B (A) B (A) B (A) A (B) A (B) Control Network A (B) A,B A,B A A,B Central 1 B A,B Central 2 B A 17 L2PU racks B SVs L2PU conc. L2PU conc. B A L2PUs B A A A Control Network SFIs ~100 SFIs A SFIs B L2PUs B BackEnd Network ~550 L2PUs L2PU conc. A A L2PUs L2PU conc. GE Data Link 10GE Data Link B B L2PUs GE Control Link 31 May, 2006 Stefan Stancu 8 DataFlow network ~160 ROSs ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROS PC ROS PC A ~20 ROS conc. switches B ROS PC ROS PC A ROS conc. ROB ROS PC ROS PC A B ROB ROB ROB ROB ROB B ROS conc. ROS conc. B (A) B (A) B (A) A (B) A (B) Control Network A (B) ~100 Gbit/s A,B A,B A A,B Central 1 B A,B Central 2 B A 17 L2PU racks B SVs L2PU conc. L2PU conc. B A L2PUs B A A A Control Network SFIs ~100 SFIs A SFIs B L2PUs B BackEnd Network ~550 L2PUs L2PU conc. A A L2PUs L2PU conc. GE Data Link 10GE Data Link B B L2PUs GE Control Link 31 May, 2006 Stefan Stancu 9 DataFlow network ~160 ROSs ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROS PC ROS PC A ~20 ROS conc. switches B ROS PC ROS PC A ROS conc. ROB ROS PC ROS PC A B ROB ROB ROB ROB ROB B ROS conc. ROS conc. B (A) B (A) B (A) A (B) A (B) Control Network A (B) A,B A,B A A,B Central 1 B A,B Central 2 B A 17 L2PU racks B SVs L2PU conc. L2PU conc. B A L2PUs B A A A Control Network SFIs A ~100 SFIs SFIs Two central switchesB B L2PUs •One VLAN per switch (A and B) •Fault tolerant BackEnd(tolerate Networkone switch failure) ~550 L2PUs L2PU conc. A A L2PUs L2PU conc. GE Data Link 10GE Data Link B B L2PUs GE Control Link 31 May, 2006 Stefan Stancu 10 DataFlow network ~160 ROSs ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROS PC ROS PC A ~20 ROS conc. switches ROB ROB ROS PC ROS PC ROB ROB ROS PC 10G ROS “concentration” ROS PC ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB •Motivated by the need to use fibre transmission (distance > 100m) A A B provided by aggregating •Full bandwidth 10x BGE into 1x10GE •Requires the use of VLANs (and MST) to maintain a loop-free topology B ROS conc. ROB ROS conc. ROS conc. B (A) B (A) B (A) A (B) A (B) Control Network A (B) A,B A,B A A,B Central 1 B A,B Central 2 B A 17 L2PU racks B SVs L2PU conc. L2PU conc. B A L2PUs B A A A Control Network SFIs ~100 SFIs A SFIs B L2PUs B BackEnd Network ~550 L2PUs L2PU conc. A A L2PUs L2PU conc. GE Data Link 10GE Data Link B B L2PUs GE Control Link 31 May, 2006 Stefan Stancu 11 DataFlow network ~160 ROSs ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROS PC ROS PC A ~20 ROS conc. switches B ROS PC ROS PC A ROS conc. ROB ROS PC ROS PC A B ROB ROB ROB ROB ROB B ROS conc. ROS conc. B (A) B (A) B (A) A (B) A (B) Control Network A (B) A,B A,B A A,B Central 1 B A,B Central 2 B A 17 L2PU racks B SVs L2PU conc. L2PU conc. B A L2PUs B A A A Control Network SFIs ~100 SFIs A SFIs B L2PUs B BackEnd Network ~550 L2PUs L2PU conc. A A 10G L2PUL2PUs concentration •One switch per rack 31 May, 2006 L2PU conc. GE Data Link 10GE Data Link B B L2PUs GE Control Link Stefan Stancu 12 DataFlow network ~160 ROSs ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROS PC ROS PC A ~20 ROS conc. switches B ROS PC ROS PC A ROS conc. ROB ROS PC ROS PC A B ROB ROB ROB ROB ROB B ROS conc. ROS conc. B (A) B (A) B (A) A (B) A (B) Control Network A (B) A,B A,B A A,B Central 1 B A,B Central 2 B A 17 L2PU racks B SVs L2PU conc. L2PU conc. B A L2PUs B A A A Control Network SFIs ~100 SFIs A SFIs B L2PUs B BackEnd Network ~550 L2PUs L2PU conc. A A L2PUs L2PU conc. GE Data Link 10GE Data Link B B L2PUs GE Control Link 31 May, 2006 Stefan Stancu 13 BackEnd network GE Data Link 10GE Data Link DataFlow Network GE Control Link ~100 SFIs SFIs Control Network SFOs Central 1 57 EF racks EF conc. EF conc. EF conc. SFO conc. ~30 SFOs • • • EFs EFs EFs • 31 May, 2006 Stefan Stancu Mass storage ~2000 end-nodes One core device with built-in redundancy (eventually two devices) Rack level concentration with use of link aggregation for redundant uplinks to the core. Layer 3 routed network to restrict broadcast domains size. 14 BackEnd network GE Data Link 10GE Data Link DataFlow Network GE Control Link ~100 SFIs SFIs ~30 SFOs ~2.5 Gbit/s SFOs Central 1 57 EF racks EF conc. EF conc. EF conc. SFO conc. ~50 Gbit/s Control Network • • • EFs EFs EFs • 31 May, 2006 Stefan Stancu Mass storage ~2000 end-nodes One core device with built-in redundancy (eventually two devices) Rack level concentration with use of link aggregation for redundant uplinks to the core. Layer 3 routed network to restrict broadcast domains size. 15 BackEnd network GE Data Link 10GE Data Link DataFlow Network GE Control Link ~100 SFIs SFIs Control Network SFOs Central 1 57 EF racks EF conc. EF conc. EF conc. SFO conc. ~30 SFOs • • • EFs EFs EFs • 31 May, 2006 Stefan Stancu Mass storage ~2000 end-nodes One core device with built-in redundancy (eventually two devices) Rack level concentration with use of link aggregation for redundant uplinks to the core. Layer 3 routed network to restrict broadcast domains size. 16 Interchangeable processing power • • Standard processor rack with up-links to both DataFlow and BackEnd networks. The processing power migration between L2 and EF is achieved by software enabling/disabling of the appropriate up-links. ROB ROB ROB GE Data Link 10GE Data Link ROB ROB ROB ROS PC ROS PC ROB ROB ROB ROB ROB ROB GE Control Link DataFlow Network Data conc. Ctrl. Conc. PUs SVs SFIs L2PUs BackEnd Network XPU rack EFPs 31 May, 2006 Stefan Stancu EFPs 17 EFPs Switch Management Network Management Servers Control Network • It is good practice not to mix management traffic and normal traffic. If the normal traffic has abnormal patterns it may overload some links and potentially starve the in-band management traffic • The switch management network provides the management servers access the out-of-band interface of all the devices from the control and data networks • Flat Layer 2 Ethernet network with no redundancy. SDX Level 2 Ctrl. core Ctrl. core Mgmt root (24 port) 24 port 48 port DF core DF core BE core Ctrl. Rack Conc. Data Rack conc. Ctrl. Rack Conc. Data Rack conc. Ctrl. Rack Conc. Data Rack conc. Ctrl. Rack Conc. Data Rack conc. SDX Level 1 24 port 48 port 48 port Ctrl. Rack Conc. Data Rack conc. Ctrl. Rack Conc. Data Rack conc. Ctrl. Rack Conc. Data Rack conc. Ctrl. Rack Conc. Data Rack conc. Ctrl. Rack Conc. Data Rack conc. Ctrl. Rack Conc. Data Rack conc. USA 15 Level 2 24 port ROS Data conc. ROS Data conc. GE Mgmt Link GE fibre Mgmt Link GE Control Link 24 port USA 15 Level 1 ROS Data conc. ROS Data conc. 31 May, 2006 Stefan Stancu 18 Conclusions • The ATLAS TDAQ system (approx. 3000 end-nodes) relies on networks for both control and data acquisition purposes. • Ethernet technology (+IP) • Networks architecture maps on multi-vendor devices • Modular network design • Resilient network design (high availability) • Separate management path 31 May, 2006 Stefan Stancu 19 Back-up slides 31 May, 2006 Stefan Stancu 20 Option A ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROS PC ROS PC ROS PC ROS PC ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROS PC ROS PC ROB ROB ROB ROB ROB USA 15 SDX Central 1 SVs L2PU conc. L2PUs 31 May, 2006 Central 2 SFIs •1GE links •Central Sw. in SDX L2PU conc. SFIs Stefan Stancu L2PUs 1000-BaseT (GE copper) 1000-BaseSX (GE fibre) 10G-BaseSR (10GE copper) 21 Option B ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROS PC ROB ROS PC ROS PC ROB ROS PC ROB ROS conc. ROS PC ROB ROS PC ROB ROB ROS conc. ROS conc. USA 15 SDX Central 1 SVs L2PU conc. L2PUs 31 May, 2006 SFIs •10GE links •Central Sw. in SDX Central 2 L2PU conc. SFIs Stefan Stancu L2PUs 1000-BaseT (GE copper) 1000-BaseSX (GE fibre) 10G-BaseSR (10GE copper) 22 Option C ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROS PC ROS PC ROS PC ROS PC Central 1 ROB ROS PC ROS PC ROB ROB ROB ROB ROB •10GE links •Central Sw. inUSA USA 15 Central 2 SDX SV conc. L2PU conc. L2PUs 31 May, 2006 SFI conc. SFI conc. SFIs SVs SFIs Stefan Stancu L2PU conc. L2PUs 1000-BaseT (GE copper) 1000-BaseSX (GE fibre) 10G-BaseSR (10GE copper) 23 20 ROSs trunking ROB ROB ROB ROB ROB ROB 10x ROS PCPC ROS 10xGE A ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB 10xGE B ROS conc. ROS conc. Trunk 10x ROS PCPC ROS 10xGE B 10xGE A ROB ROB ROB ROB ROB ROB B B Trunk A A A,B A,B A,B Central 1 A,B Central 2 B A A SVs L2PU conc. A B A L2PU conc. B A L2PUs B SFIs SFIs B L2PUs GE Data Link 10GE Data Link 31 May, 2006 Stefan Stancu 24 20 ROSs VLANs ROB ROB ROB ROB ROB ROB 10x ROS PCPC ROS 10xGE A ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB ROB 10xGE B ROB ROB ROB ROB ROB ROB 10x ROS PCPC ROS 10xGE A 10xGE B ROS conc. ROS conc. B (A) A (B) A (B) B (A) A,B A,B A,B Central 1 A,B Central 2 B A A SVs L2PU conc. A B A L2PU conc. B A L2PUs B SFIs SFIs B L2PUs GE Data Link 10GE Data Link 31 May, 2006 Stefan Stancu 25 10 ROSs single switch ROB ROB ROB ROB ROB ROB ROB ROS10x PC ROS PC 10xGE A ROB ROB ROB ROB ROB 10xGE B ROS conc. B (A) A (B) A,B A,B A,B Central 1 A,B Central 2 B A SVs L2PU conc. A B A A L2PU conc. B A L2PUs B SFIs SFIs B L2PUs GE Data Link 10GE Data Link 31 May, 2006 Stefan Stancu 26 10 ROSs double switch ROB ROB ROB ROB ROB ROB ROB ROS10x PC ROS PC ROB ROB ROB ROB ROB 10xGE B 10xGE A ROS conc. ROS conc. B A A,B A,B A,B Central 1 A,B Central 2 B A SVs L2PU conc. A B A A L2PU conc. B A L2PUs B SFIs SFIs B L2PUs GE Data Link 10GE Data Link 31 May, 2006 Stefan Stancu 27 Sample resiliency test (see [4]) MST resiliency test 10 9 VLAN2 8 Throughput (Gbit/s) VLAN1 Core MST 7 6 5 4 3 2 Aggregation 1 10 VLAN1 throughput VLAN2 throughput 20 30 40 50 60 70 80 90 Time (s) VLAN1 31 May, 2006 VLAN2 Stefan Stancu 28 100 Sample resiliency test MST resiliency test 10 9 VLAN2 8 Throughput (Gbit/s) VLAN1 Core MST 7 6 5 4 3 2 Aggregation 1 10 VLAN1 throughput VLAN2 throughput 20 30 40 50 60 70 80 90 Time (s) VLAN1 31 May, 2006 VLAN2 Stefan Stancu 29 100 Sample resiliency test MST resiliency test 10 9 VLAN2 8 Throughput (Gbit/s) VLAN1 Core MST 7 6 5 4 3 2 Aggregation 1 10 VLAN1 throughput VLAN2 throughput 20 30 40 50 60 70 80 90 Time (s) VLAN1 31 May, 2006 VLAN2 Stefan Stancu 30 100 Sample resiliency test MST resiliency test 10 9 VLAN2 8 Throughput (Gbit/s) VLAN1 Core MST 7 6 5 4 3 2 Aggregation 1 10 VLAN1 throughput VLAN2 throughput 20 30 40 50 60 70 80 90 Time (s) VLAN1 31 May, 2006 VLAN2 Stefan Stancu 31 100 Sample resiliency test MST resiliency test 10 9 VLAN2 8 Throughput (Gbit/s) VLAN1 Core MST 7 6 5 4 3 2 Aggregation 1 10 VLAN1 throughput VLAN2 throughput 20 30 40 50 60 70 80 90 Time (s) VLAN1 31 May, 2006 VLAN2 Stefan Stancu 32 100 Control network – Online services Control - Core C1 C2 C3 C4 Control - Core GE ctrl 10GE ctrl GE mgmt M5 C1 C2 C3 C4 M5 GE ctrl GE mgmt Control - conc sw C1 C2 C3 C4 M5 CPU - CPU - C1 C1 CPU - C1 CPU - C1 CPU - C1 CPU - C1 (a) 10Gbps Concatenated Control traffic 31 May, 2006 (b) 1Gbps Direct connection to Control Core Stefan Stancu 33 Control network – rack connectivity GE ctrl/mgmt GE ctrl Control - conc sw C1 C2 C3 C4 C5 GE ctrl GE mgmt M6 Control - conc sw C1 C2 C3 C4 CPU CPU - C1 D1 C1 D1 CPU CPU - C1 D1 C1 D1 CPU - D1 CPU - C1 D1 C1 FILE - SERVER FILE - SERVER C1 C1 (a) Shared Control and Management 31 May, 2006 (b) Separate Stefan Stancu Control and Management 34 C5 M6 Control network – rack connectivity GE data 10GE data GE ctrl GE mgmt GE data GE ctrl GE mgmt DATA - DF Core D1 D3 D2 Control - conc sw Mn C1 C2 C3 C4 C5 CPU D1 C1 D1 D2 D3 Dn DATA - conc sw Control - conc sw DATA - DF Core C1 Mn C2 C3 C4 C5 M6 D1 D2 D3 D4 Control - conc sw M5 C1 C2 C3 C4 C5 D2 CPU - M6 D1 C1 D2 CPU D1 CPU - CPU - FILE - SERVER C1 C1 D1 C1 CPU - CPU - D1 D2 CPU - CPU - D1 C1 C1 D1 C1 D1 D1 C1 C1 D1 D2 D3 FILE - SERVER FILE - SERVER C1 C1 DC control 31 May, 2006 L2PU Stefan Stancu DATA - BE Core Mn GE data GE ctrl GE mgmt SFIs 35 M6 Control network – rack connectivity DATA - BE Core D1 D3 D2 Control - conc sw Mn C1 C2 C3 C4 C5 GE data GE ctrl GE mgmt M6 Control/Data - conc sw CPU D1 GE ctrl/data GE mgmt C1 D2 CPU - DATA - conc sw C1 C2 C3 C4 C5 D1 D1 Control - conc sw M6 D2 D3 D4 M5 C1 C2 C3 C1 D2 CPU - CPU - CD1 D1 C1 CPU - C1 D1 D2 CPU FILE - SERVER CD1 C1 CPU - C1 D1 CPU CPU - CD1 D1 D1 D2 D3 Mn DATA - SFO out SFO 31 May, 2006 GE data GE ctrl GE mgmt FILE - SERVER C1 FILE - SERVER C1 EF shared ctrl/data Stefan Stancu C1 EF separated cltr/data 36 C4 C5 M6