Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Huawei Business Continuity and Disaster Recovery Solution V100R003C10 (Geo-Redundant Mode) Technical White Paper Issue 01 Date 2015-12-08 HUAWEI TECHNOLOGIES CO., LTD. Copyright © Huawei Technologies Co., Ltd. 2015. All rights reserved. No part of this document may be reproduced or transmitted in any form or by any means without prior written consent of Huawei Technologies Co., Ltd. Trademarks and Permissions and other Huawei trademarks are trademarks of Huawei Technologies Co., Ltd. All other trademarks and trade names mentioned in this document are the property of their respective holders. Notice The purchased products, services and features are stipulated by the contract made between Huawei and the customer. All or part of the products, services and features described in this document may not be within the purchase scope or the usage scope. Unless otherwise specified in the contract, all statements, information, and recommendations in this document are provided "AS IS" without warranties, guarantees or representations of any kind, either express or implied. The information in this document is subject to change without notice. Every effort has been made in the preparation of this document to ensure accuracy of the contents, but all statements, information, and recommendations in this document do not constitute a warranty of any kind, express or implied. Huawei Technologies Co., Ltd. Address: Huawei Industrial Base Bantian, Longgang Shenzhen 518129 People's Republic of China Website: http://e.huawei.com Issue 01 (2015-01-30) Huawei Proprietary and Confidential Copyright © Huawei Technologies Co., Ltd. i Huawei Business Continuity and Disaster Recovery Solution Disaster Recovery Data Center Solution (Geo-Redundant Mode) Technical White Paper Contents Contents 1 Overview......................................................................................................................................... 1 1.1 Business Continuity Challenges ................................................................................................................................... 1 1.2 Solution Overview ........................................................................................................................................................ 1 1.3 Solution Highlights ....................................................................................................................................................... 2 2 Solution Architecture ................................................................................................................... 3 2.1 Cascaded Network Architecture ................................................................................................................................... 4 2.1.1 Cascaded Network in Synchronous + Synchronous Mode ........................................................................................ 4 2.1.2 Cascaded Network in Asynchronous + Asynchronous Mode .................................................................................... 4 2.2 Parallel Network Architecture....................................................................................................................................... 5 2.2.1 Parallel Network in Synchronous + Asynchronous Mode ......................................................................................... 5 2.2.2 Parallel Network in Asynchronous + Asynchronous Mode ....................................................................................... 5 2.3 Active-Active Network Architecture ............................................................................................................................ 6 2.3.1 VIS Active-Active + Asynchronous Mode ................................................................................................................ 6 2.3.2 HyperMetro + Asynchronous Remote Replication .................................................................................................... 7 2.4 Technology Implementation Requirements of Key Components ................................................................................. 7 3 Solution Working Principles ...................................................................................................... 9 3.1 Working Principle of the Cascaded Network in Synchronous + Asynchronous Mode ................................................. 9 3.1.1 Initial Synchronization ............................................................................................................................................... 9 3.1.2 I/O Handling Process ............................................................................................................................................... 10 3.1.3 Failover .................................................................................................................................................................... 11 3.1.4 Failback ................................................................................................................................................................... 11 3.1.5 Link or DR Center Failure ....................................................................................................................................... 11 3.2 Working Principle of the Parallel Network in Synchronous + Asynchronous Mode .................................................. 12 3.2.1 Initial Synchronization ............................................................................................................................................. 12 3.2.2 I/O Handling Process ............................................................................................................................................... 12 3.2.3 Failover .................................................................................................................................................................... 13 3.2.4 Failback ................................................................................................................................................................... 13 3.2.5 Link or DR Center Failure ....................................................................................................................................... 14 3.3 Working Principle of the Cascaded Network in Asynchronous + Asynchronous Mode ............................................. 14 3.3.1 Initial Synchronization ............................................................................................................................................. 14 3.3.2 Processing in Normal Status .................................................................................................................................... 15 3.3.3 Failover .................................................................................................................................................................... 17 Issue 01 (2015-01-30) Huawei Proprietary and Confidential Copyright © Huawei Technologies Co., Ltd. ii Huawei Business Continuity and Disaster Recovery Solution Disaster Recovery Data Center Solution (Geo-Redundant Mode) Technical White Paper Contents 3.3.4 Failback ................................................................................................................................................................... 17 3.3.5 Link or DR Center Failure ....................................................................................................................................... 17 3.4 Working Principle of the Parallel Network in Asynchronous + Asynchronous Mode ................................................ 18 3.4.1 Initial Synchronization ............................................................................................................................................. 18 3.4.2 Processing in Normal Status .................................................................................................................................... 18 3.4.3 Failover .................................................................................................................................................................... 20 3.4.4 Failback ................................................................................................................................................................... 21 3.4.5 Link or DR Center Failure ....................................................................................................................................... 21 3.5 Working Principle of the Network in VIS Active-Active + Asynchronous Mode ...................................................... 21 3.5.1 Initial Synchronization ............................................................................................................................................. 21 3.5.2 Processing in Normal Status .................................................................................................................................... 22 3.5.3 Failover .................................................................................................................................................................... 23 3.5.4 Failback ................................................................................................................................................................... 23 3.5.5 Link or DR Center Failure ....................................................................................................................................... 24 3.6 Working Principle of the Network in HyperMetro + Asynchronous Mode ................................................................ 24 3.6.1 Initial Synchronization ............................................................................................................................................. 24 3.6.2 Processing in Normal Status .................................................................................................................................... 24 3.6.3 Failover .................................................................................................................................................................... 25 3.6.4 Failback ................................................................................................................................................................... 26 3.6.5 Link or DR Center Failure ....................................................................................................................................... 26 3.7 Key Technical Principles of the Disaster Recovery Data Center Solution (Geo-Redundant Mode) .......................... 27 3.8 DR Management ......................................................................................................................................................... 30 4 Service Recovery Process of the Disaster Recovery Data Center Solution (Geo-Redundant Mode) ................................................................................................................ 33 4.1 DR Test Process .......................................................................................................................................................... 33 4.2 Scheduled Migration Process ...................................................................................................................................... 34 4.3 Failover Process .......................................................................................................................................................... 34 5 Summary ....................................................................................................................................... 36 6 Acronyms and Abbreviations ................................................................................................... 37 Issue 01 (2015-01-30) Huawei Proprietary and Confidential Copyright © Huawei Technologies Co., Ltd. iii Huawei Business Continuity and Disaster Recovery Solution Disaster Recovery Data Center Solution (Geo-Redundant Mode) Technical White Paper Figures Figures Figure 2-1 Cascaded network architecture for the Disaster Recovery Data Center Solution (Geo-Redundant Mode) ..................................................................................................................................................................... 4 Figure 2-2 Parallel network architecture for the Disaster Recovery Data Center Solution (Geo-Redundant Mode) ................................................................................................................................................................................ 5 Figure 2-3 VIS Active-Active architecture for the Disaster Recovery Data Center Solution (Geo-Redundant Mode) ..................................................................................................................................................................... 6 Figure 2-4 HyperMetro + asynchronous geo-redundant 3DC DR architecture ..................................................... 7 Figure 3-1 I/O handling process for the cascaded network in synchronous + asynchronous mode .................... 10 Figure 3-2 I/O handling process for the parallel network in synchronous + asynchronous mode ....................... 12 Figure 3-3 Remote replication state shift ............................................................................................................ 28 Figure 3-4 Principle on cache-based multi-timestamp replication ...................................................................... 29 Figure 3-5 Dashboard for DR management ........................................................................................................ 30 Figure 3-6 DR management and configuration wizard ....................................................................................... 31 Figure 3-7 DR replication topology .................................................................................................................... 31 Figure 3-8 DR management topology ................................................................................................................. 32 Figure 3-9 One-click disaster recovery ............................................................................................................... 32 Figure 4-1 One-click DR test .............................................................................................................................. 33 Figure 4-2 One-click scheduled migration .......................................................................................................... 34 Figure 4-3 One-click failover .............................................................................................................................. 35 Issue 01 (2015-01-30) Huawei Proprietary and Confidential Copyright © Huawei Technologies Co., Ltd. iv Huawei Business Continuity and Disaster Recovery Solution Disaster Recovery Data Center Solution (Geo-Redundant Mode) Technical White Paper Tables Tables Table 3-1 Remote replication states ..................................................................................................................... 27 Issue 01 (2015-01-30) Huawei Proprietary and Confidential Copyright © Huawei Technologies Co., Ltd. v Huawei Business Continuity and Disaster Recovery Solution Disaster Recovery Data Center Solution (Geo-Redundant Mode) Technical White Paper 1 Overview 1 Overview About This Chapter 1.1 Business Continuity Challenges 1.2 Solution Overview 1.3 Solution Highlights 1.1 Business Continuity Challenges The rapid development of IT technologies enables information systems to play an increasingly important role in the key businesses of various industries. IT system service interruption in various fields and sectors including communications, finance, medical, e-commerce, logistics, and governments may lead to great economic losses, brand image damage, and critical data loss. Therefore, business continuity is critical to IT systems. In recent years, natural disasters affecting large areas occur frequently. Therefore, the Disaster Recovery Data Center Solution (Geo-Redundant Mode) that combines a same-city DR center and a remote DR center is becoming increasingly popular in industries. 1.2 Solution Overview The solution includes one production center, one same-city DR center, and one remote DR center. Data of the production center is replicated to the same-city DR center synchronously and to the remote DR center asynchronously. The same-city DR center often has the same service processing capabilities as the production center. Applications can be switched from the production center to the same-city DR center without data loss to ensure business continuity. When a natural disaster occurs, such as an earthquake that affects both the production center and same-city DR center, applications can be switched to the remote DR center to ensure business continuity. By implementing a procedure that is tested in routine disaster drills, applications can continue to provide services in the remote DR center within an acceptable time to ensure business continuity. Remote recovery often causes the loss of a small amount of data. Issue 01 (2015-01-30) Huawei Proprietary and Confidential Copyright © Huawei Technologies Co., Ltd. 1 Huawei Business Continuity and Disaster Recovery Solution Disaster Recovery Data Center Solution (Geo-Redundant Mode) Technical White Paper 1 Overview Compared with a solution that includes only a same-city DR center or remote DR center, the Disaster Recovery Data Center Solution (Geo-Redundant Mode) combines the advantages of both types of DR center to address natural disasters that affect broader areas. Whenever a disaster that affects a small area or a natural disaster that affects a large area occurs, the DR system in this solution responds quickly to prevent data loss to the maximum extent and achieve better recovery point objective (RPO) and recovery time objective (RTO). Therefore, this solution is widely used. 1.3 Solution Highlights The Disaster Recovery Data Center Solution (Geo-Redundant Mode) has the following highlights: Applicability of Various Disk Array Replication Technologies All Huawei storage products use the unified storage operating system platform. Remote replication relationships can be set up among high-end, mid-range, and entry-level disk arrays. Customers can select disk arrays for their remote DR centers based on their business requirements. This enables them to set up DR systems with high cost-effectiveness. Second-Level RPO for Asynchronous Replication and Minute-Level RTO Asynchronous remote replication using the multi-timestamp cache technology supports a replication cycle of as short as 3 seconds. Huawei DR management software OceanStor ReplicationDirector provides the one-click DR test and failover functions to greatly simplify DR operations and reduce the recovery time of a database to the minute level. Visualized Management of DR Services and Topologies The OceanStor ReplicationDirector uses graphics to show the physical and logical service topologies of the Disaster Recovery Data Center Solution (Geo-Redundant Mode). It supports one-click DR test and failover functions and allows customers to use customized scripts to recover DR service systems, simplifying DR system management and maintenance. Issue 01 (2015-01-30) Huawei Proprietary and Confidential Copyright © Huawei Technologies Co., Ltd. 2 Huawei Business Continuity and Disaster Recovery Solution Disaster Recovery Data Center Solution (Geo-Redundant Mode) Technical White Paper 2 2 Solution Architecture Solution Architecture About This Chapter The Disaster Recovery Data Center Solution (Geo-Redundant Mode) is a critical trend in fields including telecommunications, finance, and manufacturing. According to such a solution, a nearby data center (same-city data center) is set up to achieve data protection with zero data loss, while a remote data center (remote DR center) is set up to achieve data protection against regional disasters. The Disaster Recovery Data Center Solution (Geo-Redundant Mode) supports cascaded network in synchronous + asynchronous or asynchronous + asynchronous mode (A -> B and B -> C), parallel network in synchronous + asynchronous or asynchronous + asynchronous mode (A -> B and A -> C), and active-active network in active-active and asynchronous mode (A <-> B and B -> C). 2.1 Cascaded Network Architecture 2.2 Parallel Network Architecture 2.3 Active-Active Network Architecture 2.4 Technology Implementation Requirements of Key Components Issue 01 (2015-01-30) Huawei Proprietary and Confidential Copyright © Huawei Technologies Co., Ltd. 3 Huawei Business Continuity and Disaster Recovery Solution Disaster Recovery Data Center Solution (Geo-Redundant Mode) Technical White Paper 2 Solution Architecture 2.1 Cascaded Network Architecture Figure 2-1 Cascaded network architecture for the Disaster Recovery Data Center Solution (Geo-Redundant Mode) 2.1.1 Cascaded Network in Synchronous + Synchronous Mode As shown in Figure 2-1, disk array A is deployed in the production center and disk array B is deployed in the same-city DR center. The two data centers are interconnected using Fibre Channel links. A synchronous remote replication relationship is set up between disk array A of the production center and disk array B of the same-city DR center to synchronize data from disk array A to disk array B in real time. Disk array C is deployed in the remote DR center. The asynchronous remote replication relationship is set up between disk array B of the same-city DR center and disk array C of the remote DR center to regularly synchronize data from disk array B to disk array C. The DR management software is deployed in the same-city DR center and remote DR center to manage the three data centers. The software shows the physical topology and service logical topology of the solution. It also supports one-click DR tests and recovery in the same-city DR center and remote DR center. 2.1.2 Cascaded Network in Asynchronous + Asynchronous Mode As shown in Figure 2-1, disk array A is deployed in the production center and disk array B is deployed in the same-city DR center. The two data centers are interconnected using Fibre Channel links or IP links based on the bandwidth requirement of data change volume. An asynchronous remote replication relationship is set up between disk array A of the production center and disk array B of the same-city DR center to regularly synchronize data from disk array A to disk array B. Disk array C is deployed in the remote DR center. The asynchronous remote replication relationship is set up between disk array B of the same-city DR center and disk array C of the remote DR center to regularly synchronize data from disk array B to disk array C. The DR management software is deployed in the same-city DR center and remote DR center to manage the three data centers. The software shows the physical topology and service logical topology of the solution. It also supports one-click DR tests and recovery in the same-city DR center and remote DR center. Issue 01 (2015-01-30) Huawei Proprietary and Confidential Copyright © Huawei Technologies Co., Ltd. 4 Huawei Business Continuity and Disaster Recovery Solution Disaster Recovery Data Center Solution (Geo-Redundant Mode) Technical White Paper 2 Solution Architecture 2.2 Parallel Network Architecture Figure 2-2 Parallel network architecture for the Disaster Recovery Data Center Solution (Geo-Redundant Mode) 2.2.1 Parallel Network in Synchronous + Asynchronous Mode As shown in Figure 2-2, disk array A is deployed in the production center and disk array B is deployed in the same-city DR center. The two data centers are interconnected using Fibre Channel links. A synchronous remote replication relationship is set up between disk array A of the production center and disk array B of the same-city DR center to synchronize data from disk array A to disk array B in real time. Disk array C is deployed in the remote DR center. The asynchronous remote replication relationship is set up between disk array A of the production center and disk array C of the remote DR center using IP links between the production center and remote DR center to regularly synchronize data from disk array A to disk array C. The DR management software is deployed in the same-city DR center and remote DR center to manage the three data centers. The software shows the physical topology and service logical topology of the solution. It also supports one-click DR tests and recovery in the same-city DR center and remote DR center. 2.2.2 Parallel Network in Asynchronous + Asynchronous Mode As shown in Figure 2-2, disk array A is deployed in the production center and disk array B is deployed in the same-city DR center. The two data centers are interconnected using Fibre Channel links or IP links based on the bandwidth requirement of data change volume. An asynchronous remote replication relationship is set up between disk array A of the production center and disk array B of the same-city DR center to regularly synchronize data from disk array A to disk array B. Disk array C is deployed in the remote DR center. The asynchronous remote replication relationship is set up between disk array A of the production center and disk array C of the remote DR center to regularly synchronize data from disk array A to disk array C. The DR management software is deployed in the same-city DR center and remote DR center to manage the three data centers. The software shows the physical topology and service logical topology of the solution. It also supports one-click DR tests and recovery in the same-city DR center and remote DR center. Issue 01 (2015-01-30) Huawei Proprietary and Confidential Copyright © Huawei Technologies Co., Ltd. 5 Huawei Business Continuity and Disaster Recovery Solution Disaster Recovery Data Center Solution (Geo-Redundant Mode) Technical White Paper 2 Solution Architecture 2.3 Active-Active Network Architecture 2.3.1 VIS Active-Active + Asynchronous Mode Figure 2-3 VIS Active-Active architecture for the Disaster Recovery Data Center Solution (Geo-Redundant Mode) As shown in Figure 2-3, a disk array is deployed on production center A and VIS6600T storage virtualization gateway is deployed on production center B. A Fibre Channel network is set up between the production centers using bare fibers or wavelength division multiplexing (WDM) devices. The Virtual Intelligent Storage (VIS) technology is used to create active-active mirrors for data. When an upper-layer service is accessed, data is written in real time to the disk arrays of production center A and production center B. Disk array C is deployed in the remote DR center. The asynchronous remote replication relationship is set up between disk array C and the disk array of either production center to regularly synchronize data to disk array C from the mirrored active-active disk arrays. The DR management software is deployed in the remote DR center to manage the active-active disk arrays and asynchronous data replication. The software shows the physical topology and service logical topology of the solution. It also supports one-click DR tests and recovery in the remote DR center. Issue 01 (2015-01-30) Huawei Proprietary and Confidential Copyright © Huawei Technologies Co., Ltd. 6 Huawei Business Continuity and Disaster Recovery Solution Disaster Recovery Data Center Solution (Geo-Redundant Mode) Technical White Paper 2 Solution Architecture 2.3.2 HyperMetro + Asynchronous Remote Replication Figure 2-4 HyperMetro + asynchronous geo-redundant 3DC DR architecture As shown in Figure 2-4, both production centers A and B are deployed with a Huawei OceanStor V3 disk array. The production centers interconnect through a Fibre Channel network using bare fibers or wavelength division multiplexing (WDM) devices or through a 10GE network. Production centers A and B provide services simultaneously. HyperMetro not only supports real-time bidirectional data mirroring, but also ensures that ensures that if one disk array fails, the other disk array takes over upper-layer services transparently without interrupting the services. Disk array C is deployed in the remote DR center. An asynchronous remote replication relationship is set up between disk array C and the disk array in either production center A or B to periodically synchronize data from one of the active-active disk arrays to storage array C. The DR management software is deployed in the remote DR center to manage the active-active disk arrays and asynchronous data replication. The software shows the physical topology and service logical topology of the solution. It also supports one-click DR tests and recovery in the remote DR center. 2.4 Technology Implementation Requirements of Key Components MAN Requirements: (Synchronous Remote Replication and Active-Active Disk Arrays) DR center distance: < 100 km; recommended distance between disk arrays in active-active mode: < 100 km; connections with bare fibers Transmission delay: < 1 ms (one-way transmission) Real network bandwidth: > write I/O bandwidth at peak hours Issue 01 (2015-01-30) Huawei Proprietary and Confidential Copyright © Huawei Technologies Co., Ltd. 7 Huawei Business Continuity and Disaster Recovery Solution Disaster Recovery Data Center Solution (Geo-Redundant Mode) Technical White Paper 2 Solution Architecture WAN Requirements: (Asynchronous Remote Replication) DR center distance: unlimited Transmission delay: < 50 ms (one-way transmission) Real network bandwidth: > average write I/O bandwidth Management Workstation The management work station communicates with the three centers. Distance to the centers: unlimited Communication network bandwidth: 10 MB/s Issue 01 (2015-01-30) Huawei Proprietary and Confidential Copyright © Huawei Technologies Co., Ltd. 8 Huawei Business Continuity and Disaster Recovery Solution Disaster Recovery Data Center Solution (Geo-Redundant Mode) Technical White Paper 3 3 Solution Working Principles Solution Working Principles About This Chapter 3.1 Working Principle of the Cascaded Network in Synchronous + Asynchronous Mode 3.2 Working Principle of the Parallel Network in Synchronous + Asynchronous Mode 3.3 Working Principle of the Cascaded Network in Asynchronous + Asynchronous Mode 3.4 Working Principle of the Parallel Network in Asynchronous + Asynchronous Mode 3.5 Working Principle of the Network in VIS Active-Active + Asynchronous Mode 3.6 Working Principle of the Network in HyperMetro + Asynchronous Mode 3.7 Key Technical Principles of the Disaster Recovery Data Center Solution (Geo-Redundant Mode) 3.8 DR Management 3.1 Working Principle of the Cascaded Network in Synchronous + Asynchronous Mode 3.1.1 Initial Synchronization When the synchronous remote replication relationship is set up, the system automatically starts initial synchronization to copy all data from the primary logical unit number (LUN) to the secondary LUN. If the primary LUN receives data from a production host during the synchronization, the data is also copied to the secondary LUN. After the initial synchronization is complete, the primary and secondary LUNs have identical data and synchronous remote replication enters the normal status. When the asynchronous remote replication relationship is set up, the system automatically starts initial synchronization to copy all data from the primary LUN to the secondary LUN. After the initial synchronization is complete, asynchronous remote replication enters the normal status. Issue 01 (2015-01-30) Huawei Proprietary and Confidential Copyright © Huawei Technologies Co., Ltd. 9 Huawei Business Continuity and Disaster Recovery Solution Disaster Recovery Data Center Solution (Geo-Redundant Mode) Technical White Paper 3 Solution Working Principles 3.1.2 I/O Handling Process Figure 3-1 I/O handling process for the cascaded network in synchronous + asynchronous mode Site A LUN 1 Production host Disk array A Dual-write of data Synchronous remote replication Site C Site B LUN 2 s ou n ron tio ch ica yn repl s A te o re m Data at time t2 LUN 12 Standby host Data at time t1 Standby host n tio iza ro n ch ound n sy kgr c led du e ba h he Sc in t Disk array C Optical fiber cable Network cable Disk array B Network cable not transmitting data Figure 3-1 shows the I/O handling process for the cascaded network in synchronous + asynchronous mode. The steps in the process are as follows: 1. The host delivers I/O data to LUN 1 of disk array A. 2. The I/O data is written to LUN 1 of site A and synchronized to LUN 12 of site B. The LUN12 is the secondary LUN for synchronous remote replication and primary LUN for asynchronous remote replication. 3. At the time of asynchronous remote replication, disk array B creates data for LUN12 corresponding to the time (such as data corresponding to the time t1). 4. Disk array C creates data for LUN 2 corresponding to the time before synchronization (such as data corresponding to the time t2). If asynchronous remote replication fails, the system rolls back using the data when LUN 2 is required by services. This ensures availability of data in disk array C. 5. The data for LUN 12 corresponding to t1 is regularly synchronized to LUN 2 in the background. Issue 01 (2015-01-30) Huawei Proprietary and Confidential Copyright © Huawei Technologies Co., Ltd. 10 Huawei Business Continuity and Disaster Recovery Solution Disaster Recovery Data Center Solution (Geo-Redundant Mode) Technical White Paper 3 Solution Working Principles At a time of asynchronous remote replication, if synchronous remote replication is disallowed due to the status of the secondary LUN (LUN 12), asynchronous remote replication is not started. When the secondary LUN enters a state that allows synchronization, data corresponding to multiple times is created and asynchronous remote replication is started. 3.1.3 Failover 1. The production center fails. If a production center is affected by a disaster and cannot provide services, data is not lost because the secondary LUN of the same-city DR center stores the same data as the primary LUN. If the same-city DR center has a standby host, the standby host can access the secondary LUN to take over the services. After the secondary LUN is accessed by the standby host, addresses of data written to the LUN are recorded for future remote data replication in an incremental manner. This reduces the service failback time. 2. The production center and same-city DR center fail. If both the production center and same-city DR center fail due to a serious disaster, most data is not lost because the secondary LUN of the remote DR center stores data of the primary LUN corresponding to a certain historical period (1-2 replication cycles before the current time). If the remote DR center has a standby host, the standby host can access the secondary LUN to take over the services. After the secondary LUN is accessed by the standby host, addresses of data written to the LUN are recorded for future remote data replication in an incremental manner. This reduces the service failback time. 3.1.4 Failback 1. Data is not damaged. After the production center is recovered, if disk array A and disk array B are not damaged and the primary LUN can restore its data, the data written to LUN 12 or LUN 2 when the primary LUN is faulty can be copied to the primary LUN in an incremental manner. After data replication, the replication relationship between the primary and secondary LUNs is retained. Then, services are switched back to the production center. The production host accesses the primary LUN of disk array A, and data is synchronized from the primary LUN to the secondary LUN in real time. 2. Data is damaged. If disk array A or disk array B is damaged and data in the disk array cannot be restored, the damaged disk array needs to be rebuilt. Replicate the data of the secondary end to primary end A and primary end B in a reverse way. Then, the original primary and secondary relationship between the disk arrays is adjusted and the services are switched back to the production center. 3.1.5 Link or DR Center Failure When the replication links between the production center and DR center fail or the DR center fails, remote replication is stopped automatically. This does not affect the normal operation of the production center. The primary LUN of the production center records data changes during the downtime. After the fault is rectified, the primary LUN automatically synchronizes data to the secondary LUN in an incremental manner. Issue 01 (2015-01-30) Huawei Proprietary and Confidential Copyright © Huawei Technologies Co., Ltd. 11 Huawei Business Continuity and Disaster Recovery Solution Disaster Recovery Data Center Solution (Geo-Redundant Mode) Technical White Paper 3 Solution Working Principles 3.2 Working Principle of the Parallel Network in Synchronous + Asynchronous Mode 3.2.1 Initial Synchronization When the synchronous remote replication relationship is set up, the system automatically starts initial synchronization to copy all data from the primary LUN to the secondary LUN. If the primary LUN receives data from a production host during the synchronization, the data is also copied to the secondary LUN. After the initial synchronization is complete, the primary and secondary LUNs have identical data and synchronous remote replication enters the normal status. When the asynchronous remote replication relationship is set up, the system automatically starts initial synchronization to copy all data from the primary LUN to the secondary LUN. After initial synchronization, asynchronous remote replication enters the normal status. 3.2.2 I/O Handling Process Figure 3-2 I/O handling process for the parallel network in synchronous + asynchronous mode Site A Data at time t1 n io at iz on d hr un n c ro sy ckg d le ba du he he in t Sc LUN 1 Production host Dual-write of data Synchronous remote replication Site B us n no tio ro ica ch pl yn re As ote m re Disk array A Site C LUN 2 Standby host Data at time t2 LUN 12 Disk array C Standby host Optical fiber cable Network cable Disk array B Issue 01 (2015-01-30) Huawei Proprietary and Confidential Copyright © Huawei Technologies Co., Ltd. Network cable not transmitting data 12 Huawei Business Continuity and Disaster Recovery Solution Disaster Recovery Data Center Solution (Geo-Redundant Mode) Technical White Paper 3 Solution Working Principles The steps in the process are as follows: 1. The host delivers I/O data to LUN 1 of disk array A. 2. The host at site A writes the I/O data to LUN 1 of site A and LUN 12 of site B. LUN 1 is the primary LUN for synchronous remote replication and primary LUN for asynchronous remote replication. 3. At the time of asynchronous remote replication, disk array A creates data for LUN 1 corresponding to the time (such as data corresponding to the time t1). 4. Disk array C creates data for LUN 2 corresponding to the time (such as data corresponding to the time t2). If asynchronous remote replication fails, the system rolls back using the data when LUN 2 is required by services. This ensures availability of data in disk array C. 5. The data for LUN 1 corresponding to t1 is regularly synchronized to LUN 2 in the background. 3.2.3 Failover 1. The production center fails. If a production center is affected by a disaster and cannot provide services, data is not lost because the secondary LUN of the same-city DR center stores the same data as the primary LUN. If the same-city DR center has a standby host, the standby host can access the secondary LUN to take over the services. After the secondary LUN is accessed by the standby host, addresses of data written to the LUN are recorded for future remote data replication in an incremental manner. This reduces the service failback time. 2. The production center and same-city DR center fail. If both the production center and same-city DR center fail due to a serious disaster, most data is not lost because the secondary LUN of the remote DR center stores data of the primary LUN corresponding to a certain historical period (replication cycles). If the remote DR center has a standby host, the standby host can access the secondary LUN to take over the services. After the secondary LUN is accessed by the standby host, addresses of data written to the LUN are recorded for future remote data replication in an incremental manner. This reduces the service failback time. 3.2.4 Failback 1. Data is not damaged. After the production center is recovered, if the disk array A and disk array B are not damaged, the primary LUN can restore its data. The data written to LUN 1' when the primary LUN is faulty is copied to the primary LUN in an incremental manner. After data replication, the replication relationship between the primary and secondary LUNs is retained. Then, services are switched back to the production center. The production host accesses the primary LUN of disk array A, and data is synchronized from the primary LUN to the secondary LUN in real time. 2. Data is damaged. If disk array A or disk array B is damaged and data in the disk array cannot be restored, the damaged disk array needs to be rebuilt. Replicate the data of the secondary end to the disk array A and B in a reverse way. Then, the original primary and secondary relationship between the disk arrays is adjusted and the services are switched back to the production center. Issue 01 (2015-01-30) Huawei Proprietary and Confidential Copyright © Huawei Technologies Co., Ltd. 13 Huawei Business Continuity and Disaster Recovery Solution Disaster Recovery Data Center Solution (Geo-Redundant Mode) Technical White Paper 3 Solution Working Principles 3.2.5 Link or DR Center Failure When the replication links between the production center and DR center fail or the DR center fails, remote replication is stopped automatically. This does not affect the normal operation of the production center. The primary LUN of the production center records data changes during the downtime. After the fault is rectified, the primary LUN automatically synchronizes data to the secondary LUN in an incremental manner. 3.3 Working Principle of the Cascaded Network in Asynchronous + Asynchronous Mode 3.3.1 Initial Synchronization Initial synchronization is implemented between the primary LUN of the production center and the secondary LUN of the same-city DR center and between the primary LUN of the same-city DR center and the secondary LUN of the remote DR center. Initial synchronization can be implemented online. If the replication bandwidth is sufficient, initial synchronization can be started immediately when configuration is completed. Otherwise, it can be implemented in any of the following ways: 1. Temporarily increase the replication bandwidth and complete initial synchronization. 2. Relocate devices to the same place and complete initial synchronization. 3. Complete initial synchronization using portable storage media. During the initial synchronization, the system automatically creates a snapshot to copy all data from the primary LUN to the secondary LUN, but does not synchronize the data added during the initial synchronization to the secondary LUN. Issue 01 (2015-01-30) Huawei Proprietary and Confidential Copyright © Huawei Technologies Co., Ltd. 14 Huawei Business Continuity and Disaster Recovery Solution Disaster Recovery Data Center Solution (Geo-Redundant Mode) Technical White Paper 3 Solution Working Principles 3.3.2 Processing in Normal Status Production center A Remote DR center C Production host Data at time t1 Disk array A Standby host Data at time t3 Same-city DR center B Disk array C Standby host Data at time t2 Disk array B The steps in the process are as follows: 1. The host delivers I/O data to LUN 1 of disk array A. Data in LUN 2 and data in LUN 3 are copies of data in LUN 1 corresponding to different times. The data in LUN 3 corresponds to a time earlier than the time corresponding to the data in LUN 2. LUN 2 is the secondary LUN for asynchronous remote replication between disk arrays A and B, and also the primary LUN for asynchronous remote replication between disk arrays B and C. The LUNs of sites B and C are read-only to hosts. 2. At the time of asynchronous remote replication between disk arrays A and B, disk array A creates data for LUN 1 corresponding to the time (such as data corresponding to the time t1). 3. Disk array B creates data for LUN 2 corresponding to the time before synchronization (such as data corresponding to the time t2). If asynchronous remote replication fails, the system rolls back using the data when LUN 2 is required by services. This ensures availability of data in disk array B. At the time of asynchronous remote replication between disk arrays B and C, disk array B creates data for LUN 2 corresponding to the time (such as data corresponding to the time t2). 4. The data for LUN 1 corresponding to t1 is regularly synchronized to LUN 2 in the background. 5. Disk array C creates data for LUN 3 corresponding to the time before synchronization (such as data corresponding to the time t3). If asynchronous remote replication fails, the system rolls back using the data when LUN 3 is required by services. Issue 01 (2015-01-30) Huawei Proprietary and Confidential Copyright © Huawei Technologies Co., Ltd. 15 Huawei Business Continuity and Disaster Recovery Solution Disaster Recovery Data Center Solution (Geo-Redundant Mode) Technical White Paper 6. 3 Solution Working Principles The data for LUN 2 corresponding to t2 is regularly synchronized to LUN 3 in the background. The following figure shows the asynchronous remote replication process. Server Asynchronous remote replication Asynchronous remote replication Disk array A Disk array B Disk array C The steps in the process are as follows: 1. Writes I/O requests are processed for primary LUN 1. 2. Data written to the primary LUN in cycle N is written to the cache. 3. In cycle N+1, the data is copied from the cache to secondary LUN 2 and the new data in cycle N+1 is written to the cache. A new cycle begins after data replication completes. 4. Step 2 is repeated. 5. Writes I/O requests are processed for secondary LUN 2. 6. When cycle N begins, snapshot activating is implemented for the secondary LUN. That is, snapshot activating is implemented on the data stored in the cache and storage media in cycle N-1. 7. In cycle N, data synchronized from primary LUN is received and written to the cache of secondary LUN. 8. After the cycle, the snapshot of secondary LUN is disabled. 9. Writes I/O requests are processed for secondary LUN 3. 10. When cycle N-1 begins, snapshot activating is implemented for the secondary LUN. That is, snapshot activating is implemented on the data stored in the cache and storage media in cycle N-2. 11. In cycle N-1, data synchronized from primary LUN is received and written to the cache of secondary LUN. 12. After the cycle, the snapshot of secondary LUN is disabled. If the write I/O bandwidth of the primary LUN is increased temporarily or the bandwidth of links between the disk arrays is decreased temporarily, which prolongs the replication cycle and greatly increases the amount of data written in a cycle to a level that exceeds the storage capacity of the cache, remote replication logs are used to record the excess data but does not stop periodic synchronization. Remote replication ensures data consistency of the secondary LUN, that is, the dependency relationship of write I/Os. For I/O processing for the primary LUN during a remote replication cycle change, the write I/Os that have the dependency relationship are included in the same cycle or included in different cycles sequentially. An earlier write I/O is included in an earlier cycle, and a later write I/O is included in a later cycle. For I/O processing for the secondary LUN, when the secondary LUN is accessed after the primary LUN fails, the system checks Issue 01 (2015-01-30) Huawei Proprietary and Confidential Copyright © Huawei Technologies Co., Ltd. 16 Huawei Business Continuity and Disaster Recovery Solution Disaster Recovery Data Center Solution (Geo-Redundant Mode) Technical White Paper 3 Solution Working Principles whether the secondary LUN has synchronized data for the current cycle. If it has not completely synchronized the data, the system uses a snapshot to roll back the secondary LUN to ensure that the data in the LUN corresponds to a cycle change time. This ensures data consistency. Asynchronous replication using the cache can achieve an RPO of 1s – 6s level. 3.3.3 Failover 1. The production center fails. If a production center is affected by a disaster and cannot provide services, data loss is limited to the minimum extent because the secondary LUN of the same-city DR center stores the data of the primary LUN corresponding to a recent time. If the same-city DR center has a standby host, the standby host can access the secondary LUN to take over the services for fastest service recovery. After the secondary LUN is accessed by the standby host, addresses of data written to the LUN are recorded for future remote data replication in an incremental manner. This reduces the service failback time. 2. The production center and same-city DR center fail. If both the production center and same-city DR center fail due to a serious disaster, most data is not lost because the secondary LUN of the remote DR center stores data of the primary LUN corresponding to a certain historical period (replication cycles). If the remote DR center has a standby host, the standby host can access the secondary LUN to take over the services. After the secondary LUN is accessed by the standby host, addresses of data written to the LUN are recorded for future remote data replication in an incremental manner. This reduces the service failback time. 3.3.4 Failback 1. Data is not damaged. After the production center is recovered, if disk array A and disk array B are not damaged and the primary LUN can restore its data, the data written to LUN 1' when the primary LUN is faulty can be copied to the primary LUN in an incremental manner. After data replication, the replication relationship between the primary and secondary LUNs is retained. Then, services are switched back to the production center. The production host accesses the primary LUN of disk array A, and data is synchronized from the primary LUN to the secondary LUN in real time. 2. Data is damaged. If disk array A or disk array B is damaged and the data cannot be restored, the disk array A or disk array B needs to be rebuilt. Replicate the data of the secondary end to the disk array A and B in a reverse way. Then, the original primary and secondary relationship between the disk arrays is adjusted and the services are switched back to the production center. 3.3.5 Link or DR Center Failure When the replication links between the production center and DR center fail or the DR center fails, remote replication is stopped automatically. This does not affect the normal operation of the production center. The primary LUN of the production center records data changes during the downtime. After the fault is rectified, the primary LUN automatically synchronizes data to the secondary LUN in an incremental manner. Issue 01 (2015-01-30) Huawei Proprietary and Confidential Copyright © Huawei Technologies Co., Ltd. 17 Huawei Business Continuity and Disaster Recovery Solution Disaster Recovery Data Center Solution (Geo-Redundant Mode) Technical White Paper 3 Solution Working Principles 3.4 Working Principle of the Parallel Network in Asynchronous + Asynchronous Mode 3.4.1 Initial Synchronization Initial synchronization is implemented between the primary LUN of the production center and the secondary LUN of the same-city DR center and between the primary LUN of the same-city DR center and the secondary LUN of the remote DR center. Initial synchronization can be implemented online. If the replication bandwidth is sufficient, initial synchronization can be started immediately when configuration is completed. Otherwise, it can be implemented in any of the following ways: 1. Temporarily increase the replication bandwidth and complete initial synchronization. 2. Relocate devices to the same place and complete initial synchronization. 3. Complete initial synchronization using portable storage media. During the initial synchronization, the system automatically creates a snapshot to copy all data from the primary LUN to the secondary LUN, but does not synchronize the data added during the initial synchronization to the secondary LUN. 3.4.2 Processing in Normal Status Production center A Data at time t3 Remote DR center C Production host Disk array A Data at time t1 Standby host Data at time t4 Same-city DR center B Disk array C Standby host Data at time t2 Disk array B The steps in the process are as follows: 1. Issue 01 (2015-01-30) The host delivers I/O data to LUN 1 of disk array A. Huawei Proprietary and Confidential Copyright © Huawei Technologies Co., Ltd. 18 Huawei Business Continuity and Disaster Recovery Solution Disaster Recovery Data Center Solution (Geo-Redundant Mode) Technical White Paper 3 Solution Working Principles Data in LUN 2 and data in LUN 3 are copies of data in LUN 1 corresponding to different times. Generally, the data in LUN 3 corresponds to a time earlier than the time corresponding to the data in LUN 2. (If the data in LUN 2 corresponds to 10:00 in the morning, the data in LUN 3 may correspond to 9:00 in the morning.) LUN 1 is the primary LUN for asynchronous remote replication between disk arrays A and B, and also the primary LUN for asynchronous remote replication between disk arrays A and C. The LUNs of sites B and C are read-only to hosts. 2. At the time of asynchronous remote replication between disk arrays A and B, disk array A creates data for LUN 1 corresponding to the time (such as data corresponding to the time t1). 3. Disk array B creates data for LUN 2 corresponding to the time before synchronization (such as data corresponding to the time t2). If asynchronous remote replication fails, the system rolls back using the data when LUN 2 is required by services. This ensures availability of data in disk array B. At the time of asynchronous remote replication between disk arrays B and C, disk array B creates data for LUN 2 corresponding to the time (such as data corresponding to the time t2). 4. The data for LUN 1 corresponding to t1 is regularly synchronized to LUN 2 in the background. 5. At the time of asynchronous remote replication between disk arrays A and C, disk array A creates data for LUN 1 corresponding to the time (such as data corresponding to the time t3). 6. Disk array C creates data for LUN 3 corresponding to the time before synchronization (such as data corresponding to the time t4). If asynchronous remote replication fails, the system rolls back using the data when LUN 3 is required by services. 7. The data for LUN 1 corresponding to t3 is regularly synchronized to LUN 3 in the background. The following figure shows the asynchronous remote replication process. Server Asynchronous remote replication Asynchronous remote replication Disk array A Disk array B Disk array C 1. Writes I/O requests are processed for primary LUN 1. 2. Data written to the primary LUN in cycle N is written to the cache. 3. In cycle N+1, the data is copied from the cache to LUN 12 and the new data in cycle N+1 is written to the cache. A new cycle begins after data replication completes. 4. Step 2 is repeated. 5. Writes I/O requests are processed for secondary LUN 1'. 6. When cycle N begins, snapshot activating is implemented for the secondary LUN. That is, snapshot activating is implemented on the data stored in the cache and storage media in cycle N-1. Issue 01 (2015-01-30) Huawei Proprietary and Confidential Copyright © Huawei Technologies Co., Ltd. 19 Huawei Business Continuity and Disaster Recovery Solution Disaster Recovery Data Center Solution (Geo-Redundant Mode) Technical White Paper 3 Solution Working Principles 7. In cycle N, data synchronized from primary LUN is received and written to the cache of secondary LUN. 8. After the cycle, the snapshot of secondary LUN is disabled. 9. Writes I/O requests are processed for secondary LUN 2. 10. When cycle N-1 begins, snapshot activating is implemented for the secondary LUN. That is, snapshot activating is implemented on the data stored in the cache and storage media in cycle N-2. 11. In cycle N-1, data synchronized from primary LUN is received and written to the cache of secondary LUN. 12. After the cycle, the snapshot of secondary LUN is disabled. If the write I/O bandwidth of the primary LUN is increased temporarily or the bandwidth of links between the disk arrays is decreased temporarily, which prolongs the replication cycle and greatly increases the amount of data written in a cycle to a level that exceeds the storage capacity of the cache, remote replication logs are used to record the excess data but does not stop periodic synchronization. Remote replication ensures data consistency of the secondary LUN, that is, the dependency relationship of write I/Os. For I/O processing for the primary LUN during a remote replication cycle change, the write I/Os that have the dependency relationship are included in the same cycle or included in different cycles sequentially. An earlier write I/O is included in an earlier cycle, and a later write I/O is included in a later cycle. For I/O processing for the secondary LUN, when the secondary LUN is accessed after the primary LUN fails, the system checks whether the secondary LUN has synchronized data for the current cycle. If it has not completely synchronized the data, the system uses a snapshot to roll back the secondary LUN to ensure that the data in the LUN corresponds to a cycle change time. This ensures data consistency. Asynchronous replication using the cache can achieve an RPO of 1s – 6s level. 3.4.3 Failover 1. The production center fails. If a production center is affected by a disaster and cannot provide services, data loss is limited to the minimum extent because the secondary LUN of the same-city DR center stores the data of the primary LUN corresponding to a recent time. An RPO of 0s – 6s level can be achieved. If the same-city DR center has a standby host, the standby host can access the secondary LUN to take over the services for fastest service recovery. After the secondary LUN is accessed by the standby host, addresses of data written to the LUN are recorded for future remote data replication in an incremental manner. This reduces the service failback time. 2. Issue 01 (2015-01-30) The production center and same-city DR center fail. If both the production center and same-city DR center fail due to a serious disaster, most data is not lost because the secondary LUN of the remote DR center stores data of the primary LUN corresponding to a certain historical period (replication cycles). If the remote DR center has a standby host, the standby host can access the secondary LUN to take over the services. After the secondary LUN is accessed by the standby host, addresses of data written to the LUN are recorded for future remote data replication in an incremental manner. This reduces the service failback time. Huawei Proprietary and Confidential Copyright © Huawei Technologies Co., Ltd. 20 Huawei Business Continuity and Disaster Recovery Solution Disaster Recovery Data Center Solution (Geo-Redundant Mode) Technical White Paper 3 Solution Working Principles 3.4.4 Failback 1. Data is not damaged. After the production center is recovered, if disk array A and disk array B are not damaged and the primary LUN can restore its data, the data written to LUN 1' when the primary LUN is faulty can be copied to the primary LUN in an incremental manner. After data replication, the replication relationship between the primary and secondary LUNs is retained. Then, services are switched back to the production center. The production host accesses the primary LUN of disk array A, and data is synchronized from the primary LUN to the secondary LUN in real time. 2. Data is damaged. If disk array A or disk array B is damaged and data in the disk array cannot be restored, the damaged disk array needs to be rebuilt. Replicate the data of the secondary end to the disk array A and B in a reverse way. Then, the original primary and secondary relationship between the disk arrays is adjusted and the services are switched back to the production center. 3.4.5 Link or DR Center Failure When the replication links between the production center and DR center fail or the DR center fails, remote replication is stopped automatically. This does not affect the normal operation of the production center. The primary LUN of the production center records data changes during the downtime. After the fault is rectified, the primary LUN automatically synchronizes data to the secondary LUN in an incremental manner. 3.5 Working Principle of the Network in VIS Active-Active + Asynchronous Mode 3.5.1 Initial Synchronization Initial synchronization for the network in active-active + asynchronous mode includes initial synchronization between the active-active data centers and initial synchronization between the primary LUN of the active-active data centers to the secondary LUN of the remote DR center. If the replication bandwidth is sufficient, initial synchronization can be started immediately when configuration is completed. Otherwise, it can be implemented in any of the following ways: 1. Temporarily increase the replication bandwidth and complete initial synchronization. 2. Relocate devices to the same place and complete initial synchronization. 3. Complete initial synchronization using portable storage media. During the initial synchronization, the system automatically creates a snapshot to copy all data from the primary LUN to the secondary LUN, but does not synchronize the data added during the initial synchronization to the secondary LUN. Issue 01 (2015-01-30) Huawei Proprietary and Confidential Copyright © Huawei Technologies Co., Ltd. 21 Huawei Business Continuity and Disaster Recovery Solution Disaster Recovery Data Center Solution (Geo-Redundant Mode) Technical White Paper 3 Solution Working Principles 3.5.2 Processing in Normal Status Host Disk on the host Mirrored volume Resource pool Mirrored data disk Differential bitmap disk VIS cluster Time snapshot Mirrored volume Mirror Disk array of data center A Disk array of data center B Disk array of remote DR center C The process of handling write I/O requests for VIS mirrored volume is as follows: 1. A write I/O request is delivered to a mirror volume. 2. The mirror volume duplicates the request and delivers them to the mirror data disks of the two data centers. 3. The mirror data disks send back responses that indicate write I/O operation completion. 4. The mirror volume sends back a response that indicates write I/O operation completion. 5. Remote replication begins at the specified time T to create a snapshot. 6. The disk array of the remote DR center automatically creates a timestamp snapshot, which is used for rollback in a synchronization failure. 7. Data is copied to the remote DR center in an incremental manner. When the disk array of a data center or a data center fails, the mirror volume uses the disk array of the normal data center to respond to I/O requests from the host and uses differential bitmap disks to record data changes during the downtime. After the fault is rectified, data is synchronized in an incremental manner. This helps reduce the amount of data to be synchronized, reduce the data synchronization time, and reduce the bandwidth required for data synchronization. When a disk array involving in data replication fails, if the fault can be rectified, the disk array automatically replicates data in an incremental manner after the fault is rectified. If the fault cannot be rectified, initial data synchronization needs to be implemented again. The active-active + disk array data replication mode can achieve an RPO and RTO of as low as 0s and enable the same-city DR center to take over services automatically. One-click recovery enables the remote DR center to achieve minute-level service recovery. Issue 01 (2015-01-30) Huawei Proprietary and Confidential Copyright © Huawei Technologies Co., Ltd. 22 Huawei Business Continuity and Disaster Recovery Solution Disaster Recovery Data Center Solution (Geo-Redundant Mode) Technical White Paper 3 Solution Working Principles 3.5.3 Failover In active-active + asynchronous mode, different failover modes respond to different failures, including the failure of production center A, failure of production center B, and failure of both production centers. 1. Production center A fails. When production center A fails, production center B automatically takes over services. For troubleshooting details, see the description of the active-active + asynchronous mode. 2. Production center B fails. If asynchronous replication is implemented between production center B and the remote DR center, production service takeover is not affected when production center B fails. Because asynchronous replication is adopted between production center B and the remote DR center, the current data of production center B cannot be synchronized to the remote DR center in asynchronous mode after production center B fails. 3. If the fault in the production center B can be rectified, after the fault is rectified, the active-active production center automatically synchronize the changed data to the disk array of production center B and synchronize the incremental data to the remote DR center. If the fault of production center cannot be rectified, the active-active production centers implement initial mirror data synchronization again and synchronize the initial data to the disk array of the remote DR center. Production centers A and B fail. If production centers A and B fail due to a serious disaster, most data is not lost because the secondary LUN of the remote DR center stores data of the primary LUN corresponding to a certain historical period (replication cycles). If the remote DR center has a standby host, the standby host can access the secondary LUN to take over the services. After the secondary LUN is accessed by the standby host, addresses of data written to the LUN are recorded for future remote data replication in an incremental manner. This reduces the service failback time. 3.5.4 Failback 1. Data is not damaged. After the production center is recovered, if disk array A and disk array B are not damaged and the primary LUN can restore its data, the data written to LUN 1' when the primary LUN is faulty can be copied to the primary LUN in an incremental manner. After data replication, the replication relationship between the primary and secondary LUNs is retained. Then, services are switched back to the production center. The production host accesses the primary LUN of disk array A, and data is synchronized from the primary LUN to the secondary LUN in real time. 2. Issue 01 (2015-01-30) Data is damaged. If disk array A or disk array B is damaged and data in the disk array cannot be restored, the damaged disk array needs to be rebuilt. Replicate the data of the secondary end to the disk array A and B in a reverse way. Then, the original primary and secondary relationship between the disk arrays is adjusted and the services are switched back to the production center. Huawei Proprietary and Confidential Copyright © Huawei Technologies Co., Ltd. 23 Huawei Business Continuity and Disaster Recovery Solution Disaster Recovery Data Center Solution (Geo-Redundant Mode) Technical White Paper 3 Solution Working Principles 3.5.5 Link or DR Center Failure When the replication links between the production center and DR center fail or the DR center fails, remote replication is stopped automatically. This does not affect the normal operation of the production center. The primary LUN of the production center records data changes during the downtime. After the fault is rectified, the primary LUN automatically synchronizes data to the secondary LUN in an incremental manner. 3.6 Working Principle of the Network in HyperMetro + Asynchronous Mode HyperMetro supports 3DC networking in cascaded asynchronous and parallel asynchronous modes. The two modes are similar in terms of technical principle. For details, see the following working principle of cascaded network in HyperMetro + asynchronous mode. 3.6.1 Initial Synchronization Initial synchronization for the network in HyperMetro + asynchronous replication mode includes initial synchronization between active-active data centers and initial synchronization from the primary LUN of the active-active data center to the secondary LUN of the remote DR center. It is recommended that Fibre Channel network be used between HyperMetro active-active sites. The initial synchronization can be completed directly by configuration. Based on the networking bandwidth, initial synchronization to the remote DR center can be performed in any of the following ways: 1. Temporarily increase the replication bandwidth and complete initial synchronization. 2. Relocate devices to the same place and complete initial synchronization. 3. Complete initial synchronization using portable storage media. During the initial synchronization, the system automatically creates a snapshot to copy all data from the primary LUN to the secondary LUN, but does not synchronize the data added during the initial synchronization to the secondary LUN. 3.6.2 Processing in Normal Status Issue 01 (2015-01-30) Huawei Proprietary and Confidential Copyright © Huawei Technologies Co., Ltd. 24 Huawei Business Continuity and Disaster Recovery Solution Disaster Recovery Data Center Solution (Geo-Redundant Mode) Technical White Paper 3 Solution Working Principles The process of handling write I/O requests for HyperMtro + asynchronous remote replication is as follows: 1. A write I/O request is delivered to an active-active LUN. 2. The active-active LUN delivers the request to the active-active data LUNs in the two data centers. 3. The active-active data LUNs return a message indicating that the write I/O is complete. 4. The remote asynchronous replication mode is enabled periodically. The disk array at the primary site automatically creates a timestamp snapshot, and notifies the DR center of creating a timestamp snapshot too. 5. After incremental data is copied to the remote DR center, the disk array of the remote DR center creates a timestamp snapshot, which is used for taking over services in the DR center in case of any failure during the replication. 6. Copy the incremental data to the remote DR center. 7. After incremental data is copied to the remote DR center, data in the secondary LUN of the remote DR center is complete, and the replication relationship is normal. The active-active + disk array data replication mode enables the same-city DR center to achieve an RPO and RTO of 0. The multi-timestamp cache technology enables the remote DR center to achieve second-level RPO. One-click recovery enables the remote DR center to achieve minute-level RTO. 3.6.3 Failover In active-active + asynchronous mode, different failover modes respond to different failures, including the failure of production center A, failure of production center B, and failure of both production centers. 1. Production center A fails. When production center A fails, production center B automatically takes over its services and records the data differences between two production centers. The asynchronous replication is not affected. If the storage device in production center A is rectifiable and active-active data LUNs and the active-active configuration are normal, production center B replicates the differential data generated during the failure to production center A till the active-active working status becomes normal. 2. Production center B fails. If asynchronous replication is implemented between production center B and the remote DR center, production service takeover is not affected when production center B fails. Because asynchronous replication is adopted between production center B and the remote DR center, the current data of production center B cannot be synchronized to the remote DR center in asynchronous mode after production center B fails. If the fault in production center B can be rectified, and the active-active data LUN and the active-active relationship are in normal state, after the fault is rectified, the active-active production centers automatically synchronize the differential data to the disk array of production center B and synchronize incremental data to the remote DR center. Issue 01 (2015-01-30) Huawei Proprietary and Confidential Copyright © Huawei Technologies Co., Ltd. 25 Huawei Business Continuity and Disaster Recovery Solution Disaster Recovery Data Center Solution (Geo-Redundant Mode) Technical White Paper 3 Solution Working Principles If the fault in production center B cannot be rectified, the active-active production centers implement initial mirror data synchronization again and synchronize the initial data to the remote DR center. 3. Production centers A and B fail. If production centers A and B are located near each other, both of them may fail due to a disaster. In this case, the remote DR center takes over the services. When the DR center takes over the services, data must be rolled back to the latest consistency point. In this process, data generated in a maximum of two replication cycles may be lost. After the secondary LUN of the remote DR center takes over services, the remote replication records the differential data for later incremental restoration, shortening the switchback duration. 3.6.4 Failback 1. Failback when production center A fails. If the storage device in production center A is rectifiable and active-active data LUNs and the active-active configuration are normal, production center B replicates the differential data generated during the failure to production center A till the active-active working status becomes normal. If the fault in production center A cannot be rectified, active-active configuration needs to be implemented again between production centers A and B to complete the initial data synchronization. 2. Failback when production center B fails. If the fault in the production center B can be rectified, after the fault is rectified, the active-active production center automatically synchronize the changed data to the disk array of production center B and synchronize the incremental data to the remote DR center. If the fault in production center B cannot be rectified, and the active-active data LUN and the active-active relationship are in normal state, implement active-active configuration again between production centers A and B, and asynchronous replication configuration between production center B and the DR center. Complete initial data synchronization. Recover the active-active relationship between production centers A and B and the asynchronous replication relationship between production center B and the DR center. After both the active-active mode and the asynchronous replication are recovered to normal states, complete the failback operation. 3. Production centers A and B fail. If production centers A and B are rectifiable and active-active data LUNs and the active-active configuration are normal, confirm whether to synchronize data from the DR center to production centers. If yes, replicate data from the DR center to production center B, and then synchronize data from production center B to production center A to recover services. If data in the DR center does not need to be replicated to production center B, directly recover services in production centers A and B, and the incremental data in the DR center will be overwritten. If data in production centers A and B is totally damaged, synchronize data from the DR center to production center B, and implement active-active configuration again between production centers A and B. Complete initial data synchronization. Recover the asynchronous replication relationship between production center B and the DR center. After both the active-active mode and the asynchronous replication are recovered to normal states, complete the failback operation. Issue 01 (2015-01-30) Huawei Proprietary and Confidential Copyright © Huawei Technologies Co., Ltd. 26 Huawei Business Continuity and Disaster Recovery Solution Disaster Recovery Data Center Solution (Geo-Redundant Mode) Technical White Paper 3 Solution Working Principles 3.6.5 Link or DR Center Failure HyperMetro allows you to specify a preferred site. When a network fault occurs, the preferred site has priority to take over services. In HyperMetro + asynchronous replication mode, it is recommended that production center B be configured as the preferred site. When the network fails, production center B takes over services. Production center B and the DR center can works properly to ensure the achievement of the RPO. When the links between production centers A and B are faulty, HyperMetro arbitrate services to data center B in priority, and the replication between data center B and the DR center is not affected. After services are switched to data center B, it records the differential data compared with data center A. After the network recovers, data center B synchronizes the differential data to data center A and then the active-active mode returns to normal state. When the replication links between production center B and the DR center are faulty, or the devices in the DR center are faulty, the remote replication disconnects automatically without affecting the normal running of the production system. After the remote replication disconnects automatically, production center B records the differential data during the failure. After the fault is rectified, it synchronizes the differential data to the DR center. 3.7 Key Technical Principles of the Disaster Recovery Data Center Solution (Geo-Redundant Mode) Access of Active-Active Disk Arrays In active-active + asynchronous replication mode, the key technologies of same-city active-active production centers involve multi-data-center storage cluster, uninterrupted service access, and optimized geographical access. The VIS cluster technology is used to set up the active-active storage architecture, which includes a VIS cluster with four nodes. Each node provides non-biased parallel data access for application servers using shared volumes and processes I/O requests from the application servers. The nodes back up each other and implement load balancing. When any node fails, the services it provides are switched to a normal node to ensure system reliability and business continuity. For detailed description, see the Huawei Business Continuity and Disaster Recovery Solution V100R002C10 Disaster Recovery Data Center Solution (Active-Active Mode) Technical White Paper. Remote Replication State Shift Remote replication involves states of Synchronizing, Split, Normal, Interrupted, and Invalid. The following table describes these states. Table 3-1 Remote replication states State Description Normal Remote replication enters this state when the primary and secondary LUNs have the same data after initial synchronization or when synchronization between the primary and secondary LUNs is complete. Split Remote replication enters this state when the primary and secondary LUNs contain different data after initial synchronization. Remote Issue 01 (2015-01-30) Huawei Proprietary and Confidential Copyright © Huawei Technologies Co., Ltd. 27 Huawei Business Continuity and Disaster Recovery Solution Disaster Recovery Data Center Solution (Geo-Redundant Mode) Technical White Paper State 3 Solution Working Principles Description replication also enters this state after splitting is performed when synchronization is in process or when remote replication is in Normal or Interrupted state. Synchronizing Remote replication enters this state after synchronization is performed when remote replication is in Split or Interrupted state. Interrupted Remote replication enters this state after an I/O failure, LUN failure, or replication link failure occurs when remote replication is in Normal or Synchronizing state. Invalid Remote replication enters this state when the basic pair properties of the primary disk array are different from those of the secondary disk array. The following figure shows remote replication state shift. Figure 3-3 Remote replication state shift Create Create Initial synchronization required Immediately start synchronization Initial synchronization required No immediate synchronization Link disconnected Split Synchronizing Split Link disconnected I/ O error Link recovered LUN recovered Synchronize Configure the secondary LUN device to be writable. Allow switchover between the primary and secondary LUN devices. Allow deletion. Split Synchronization complete Split synchronize Configuration damaged Configuration damaged Configure the secondary LUN device to be writable. Allow deletion on a single end. Link disconnected Dualwrite synchronous replication failure Interrupted Normal Initial synchronization not required Create Configuration damaged Configuration damaged Invalid Deletion is allowed. Issue 01 (2015-01-30) Huawei Proprietary and Confidential Copyright © Huawei Technologies Co., Ltd. 28 Huawei Business Continuity and Disaster Recovery Solution Disaster Recovery Data Center Solution (Geo-Redundant Mode) Technical White Paper 3 Solution Working Principles Cache-based Multi-Timestamp Replication HyperReplication/A uses the cache-based multi-timestamp snapshot technology. When the primary end requires copy-on-write (COW), a host can complete writing I/Os to cache without waiting for COW to complete. This reduces the adverse impacts of COW on the performance of the host and greatly reduces the adverse impacts of remote data replication on the performance of the host. During remote data replication, the primary end directly copies data from the cache to reduce delay and achieve the second-level remote replication RPO. Figure 3-4 Principle on cache-based multi-timestamp replication A host writes I/Os. Cache A message indicating that the I/Os are written successfully is sent back. Data is written to the slice whose cache timestamp is T+3. Snapshots are used to regularly update data in disks according to the COW mechanism. Disk Based on the synchronization interval, data in slices corresponding to one or more cache timestamps is copied to the DR end. Ø Data is directly copied from the cache to reduce delay. Ø Snapshots do not require real-time data update according to the COW mechanism. The synchronization has only a little adverse impact on the performance and the second-level RPO can be achieved. Block I/O Technology A consistency group for remote replication must suspend host I/O operations in specific scenarios to block I/O delivery by a host and ensure data consistency among the members in the group. Using the Block I/O technology, OceanStor can achieve microsecond-level host I/O suspension in multi-controller mode, while most devices in the industry only achieve the second-level host I/O suspension. Therefore, the Block I/O technology helps reduce adverse impacts of remote replication on host I/O performance and increase the control efficiency. Multi-Site Bad Block Repair Technology Host services may be interrupted when a disk array has bad tracks and the problem cannot be resolved using the RAID rebuilding technology or if DIF verification fails when the host reads or writes a disk array. The Disaster Recovery Data Center Solution (Geo-Redundant Mode) provides an enhanced bad block repair technology. If data has been copied to the LUN of the same-city DR center and the production LUN has a bad block that cannot be fixed or data integrity field (DIF) verification fails, the system can redirect read requests of a host to the LUN of the same-city DR center to read correct data and repair the production LUN. This greatly improves the overall reliability of the solution. Issue 01 (2015-01-30) Huawei Proprietary and Confidential Copyright © Huawei Technologies Co., Ltd. 29 Huawei Business Continuity and Disaster Recovery Solution Disaster Recovery Data Center Solution (Geo-Redundant Mode) Technical White Paper 3 Solution Working Principles Reverse Incremental Synchronization Technology OceanStor supports reverse incremental synchronization. After the secondary LUN of a DR center is configured to be writable, the LUN can be mapped to the standby production host to recover the production services. Data differences between the primary and secondary LUNs is recorded. After a switchover between the primary and secondary LUNs, the data differences are combined for reverse incremental synchronization. This enables quick service failback after disaster recovery and saves the time and resources required for full data synchronization. In geo-redundant mode, when the production center fails or both the production center and same-city DR center fail, the reverse incremental synchronization technology can be used to recover services for the same-city DR center and remote DR center. This greatly reduces the service failback time after disaster recovery and reduces the impacts of service failback. 3.8 DR Management The DR management software controls the entire DR system, manages various system resources including servers, storage devices, and software, and manages services throughout the DR process, covering DR migration, disaster recovery, DR inspection, DR analysis, and DR reports. It greatly simplifies DR system management and reduces the DR system maintenance cost. Dashboard Dashboard enables you to understand the status of the DR system. The main page displays task execution results, task execution times, protection settings for applications such as Oracle and SQL Server, statistics, and system operation information. Dashboard also clearly displays information about critical alarms of the DR system so that you can identify and rectify faults in a timely manner. Figure 3-5 Dashboard for DR management Issue 01 (2015-01-30) Huawei Proprietary and Confidential Copyright © Huawei Technologies Co., Ltd. 30 Huawei Business Continuity and Disaster Recovery Solution Disaster Recovery Data Center Solution (Geo-Redundant Mode) Technical White Paper 3 Solution Working Principles DR Configuration Wizard The configuration wizard greatly reduces technical difficulties for DR management personnel. The DR management system provides the Quick Start and Wizard modes for configuring the hardware and software resources, DR sites, and application systems of the DR system. The clear configuration steps enable quick DR service management. Figure 3-6 DR management and configuration wizard Smart Association for DR Protection Smart association for DR protection simplifies configuration and inspection of the DR system. It enables the DR management system to automatically identify hosts, applications, storage devices used by applications, and replication relationship between storage devices. With smart association for DR protection, management personnel who have the knowledge about applications on hosts can configure and manage DR for the application system in an end-to-end manner and generate DR topologies and DR details. Figure 3-7 DR replication topology Automatic DR Topology Generation The global DR topology enables you to understand the overall system status. It can clearly show the point-to-point, active-active, geo-redundant modes, DR relationship for the entire network, and operation structure. In this way, you can understand comprehensive information Issue 01 (2015-01-30) Huawei Proprietary and Confidential Copyright © Huawei Technologies Co., Ltd. 31 Huawei Business Continuity and Disaster Recovery Solution Disaster Recovery Data Center Solution (Geo-Redundant Mode) Technical White Paper 3 Solution Working Principles including the server status at the production end on the network, storage device status, replication status, and DR site device status. Figure 3-8 DR management topology One-click Disaster Recovery One-click disaster recovery enables you to address disasters easily. In the DR management system, you can test DR data availability, implement scheduled migration, and test DR system availability with one click. You can also rectify a fault that occurs on the DR end with one click. The processes, detailed steps, task execution results, and task execution status of disaster recovery are visible to you. Figure 3-9 One-click disaster recovery Issue 01 (2015-01-30) Huawei Proprietary and Confidential Copyright © Huawei Technologies Co., Ltd. 32 Huawei Business Continuity and Disaster Recovery Solution Disaster Recovery Data Center Solution (Geo-Redundant Mode) Technical White Paper 4 4 Service Recovery Process of the Disaster Recovery Data Center Solution (Geo-Redundant Mode) Service Recovery Process of the Disaster Recovery Data Center Solution (Geo-Redundant Mode) About This Chapter 4.1 DR Test Process 4.2 Scheduled Migration Process 4.3 Failover Process 4.1 DR Test Process A DR test checks whether the same-city DR center or remote DR center can be recovered when a disaster occurs and checks the disaster recovery result. The DR management system with a GUI provides the one-click DR test function. You can select a scheduled DR test task to be executed and click the Test button shown in the following figure to enable the system to automatically perform a DR test and generate a test result. Figure 4-1 One-click DR test A DR test consists of two steps: test and clearance. During a DR test, a snapshot of a DR center is used to recover the service system. Therefore, a DR test and environment clearance do not affect the production system and DR service. The DR test process is as follows: Issue 01 (2015-01-30) Huawei Proprietary and Confidential Copyright © Huawei Technologies Co., Ltd. 33 Huawei Business Continuity and Disaster Recovery Solution Disaster Recovery Data Center Solution (Geo-Redundant Mode) Technical White Paper 4 Service Recovery Process of the Disaster Recovery Data Center Solution (Geo-Redundant Mode) 1. Create a snapshot of the target LUN of the same-city DR center or remote DR center for remote replication. 2. Map the snapshot to the standby host of the DR center. 3. Start services on the standby host of the DR center. 4. Test the data availability and consistency of the same-city DR center on the standby host. The environment clearance process is as follows: 1. Stop the host test services of the same-city DR center. 2. Delete the mapping of the snapshot to the standby host of the same-city DR center. 3. Delete the snapshot. 4.2 Scheduled Migration Process During the scheduled migration process, the scenario where the production center fails is simulated and the production services are recovered in the same-city DR center to check migration feasibility and DR data availability. The DR management system with a GUI provides a one-click scheduled migration function. After applications are stopped in the production system, you can click the Execute button shown in the following figure to perform migration. Figure 4-2 One-click scheduled migration The scheduled migration process is as follows: 1. Stop the host services of the production center. 2. Delete the remote replication mapping from the primary LUN of the production center to the host of the production center. 3. Configure the LUN B of the same-city DR center to be readable and writable. 4. Map the LUN B to the standby host of the same-city DR center. 5. Start the services on the standby host of the same-city DR center. 6. Test the data availability and consistency of the same-city DR center on the standby host. 4.3 Failover Process Disasters such as fire and flood usually cause production center failures. The DR management system with a GUI provides a one-click failover function. When the production center is affected by a disaster, you can click the Execute button shown in the following figure to perform failover. Issue 01 (2015-01-30) Huawei Proprietary and Confidential Copyright © Huawei Technologies Co., Ltd. 34 Huawei Business Continuity and Disaster Recovery Solution Disaster Recovery Data Center Solution (Geo-Redundant Mode) Technical White Paper 4 Service Recovery Process of the Disaster Recovery Data Center Solution (Geo-Redundant Mode) Figure 4-3 One-click failover The failover process is applicable in the following scenarios: The production center fails and services are recovered in the same-city DR center. Both the production and same-city DR center fail and services are recovered in the remote DR center. The failover process is as follows: 1. The power supply of the production center fails and services are interrupted. 2. Configure the LUN of the remote DR center to be readable and writable. 3. Map the LUN of the remote DR center to the standby host of the same-city DR center. 4. Start the services on the standby host of the remote DR center. 5. Test the data availability and consistency of the same-city DR center on the standby host of the remote DR center. 6. The failover process ends. Issue 01 (2015-01-30) Huawei Proprietary and Confidential Copyright © Huawei Technologies Co., Ltd. 35 Huawei Business Continuity and Disaster Recovery Solution Disaster Recovery Data Center Solution (Geo-Redundant Mode) Technical White Paper 5 Summary 5 Summary This document describes the architecture, implementation principles, and disaster recovery process of the Disaster Recovery Data Center Solution (Geo-Redundant Mode). All Huawei storage products use the unified storage operating system platform. Remote replication relationships can be set up among high-end, mid-range, and entry-level disk arrays. Customers can select disk arrays for their remote DR centers based on their business requirements. This enables them to set up DR systems with high cost-effectiveness. OceanStor ReplicationDirector uses graphics to show the physical topology and service logical topology of the Disaster Recovery Data Center Solution (Geo-Redundant Mode). It supports one-click DR test and failover functions and allows customers to use customized scripts to recover DR service systems, simplifying DR system management and maintenance. Issue 01 (2015-01-30) Huawei Proprietary and Confidential Copyright © Huawei Technologies Co., Ltd. 36 Huawei Business Continuity and Disaster Recovery Solution Disaster Recovery Data Center Solution (Geo-Redundant Mode) Technical White Paper 6 6 Acronyms and Abbreviations Acronyms and Abbreviations Acronym or Abbreviation Full Form RPO Recovery Point Objective RTO Recovery Time Objective IP Internet Protocol iSCSI Internet Small Computer Systems Interface LUN Logical Unit Number Issue 01 (2015-01-30) Huawei Proprietary and Confidential Copyright © Huawei Technologies Co., Ltd. 37