Download AlwaysOn Design Patterns and Lessons Learned.ppt

SQL Server 2012: AlwaysOn HA and DR Design Patterns, and Lessons Learned from Early Customer Deployments Sanjay Mishra SQLCAT Who and Why • Who am I? • Why I am speaking about this? Setting the Stage • • • AlwaysOn ≠ Availability Groups AlwaysOn = { SQL Server Failover Cluster Instances, Availability Groups } Availability Groups ≠ Database Mirroring Key Learnings from Early Customer Deployments • • • • • • • • • • • • • Windows Cluster (Windows Server Failover Cluster, WSFC) is the foundation for HA and DR in SQL Server 2012 AlwaysOn AlwaysOn inherits all “characteristics” of Windows Cluster Windows Cluster every single AlwaysOn deployment is a Windows Cluster deployment Windows Cluster must understand Windows Cluster if you want to deploy, operate, monitor, troubleshoot, administer AlwaysOn key areas are: quorum model, cluster network communication, DR procedures, cluster.exe, PowerShell Windows Cluster ≠ SQL Cluster (SQL Server Failover Cluster Instance) therefore, is NOT necessarily a shared-storage cluster Windows Cluster many key enhancements have been made to Windows Cluster specifically for SQL Server 2012 AlwaysOn • Asymmetric Disk • Node Votes • Asymmetric Disk as Quorum resource Key Learnings from Early Customer Deployments • • • • • • • • • • • • • Organizational structure Typically, teams and skills are organized into separate groups – SQL Server DBA team and Windows Server Admin team AlwaysOn reaches out beyond the SQL Server DBA DBAs need to work closely with Windows / Network Administration teams • Not just for initial deployment, but for troubleshooting and disaster recovery as well Historical experience need to unlearn and relearn a few things if you are already experienced with Windows Cluster, but new to AlwaysOn For example, if you haven’t read the Windows Cluster documentation in the last few months, it is worth a re-read now New/Different Tools for administration and troubleshooting Windows cluster log Failover Cluster Manager Powershell cluster.exe Knowledge of PowerShell and cluster.exe command lines will come very handy SQL Server 2012 AlwaysOn HA+DR Design Patterns SQL Server 2012 AlwaysOn Solution Characteristics HA+DR Solution Multi-site Failover Cluster 1 Instance (FCI) for HA and DR 2 Availability Group for HA and DR Failover Cluster Instance for 3 local HA + Availability Group for DR Corresponding Pre-Denali Solution • • • • • Shared Storage solution * Instance Level HA (automatic) Instance Level DR (automatic *) Uses storage replication Doesn’t require database to be in FULL recovery model Multi-site FCI using stretch VLAN • • • • • Non-Shared Storage solution (Group of) Database Level HA (automatic) (Group of) Database Level DR (manual) DR replica can be Active Secondary Requires database to be in FULL recovery model Database Mirroring for Local HA and Log Shipping for DR • Combined Shared Storage and Non-Shared Storage Instance Level HA (automatic) (Group of) Database Level DR (manual) DR replica can be Active Secondary Requires database to be in FULL recovery model Asymmetric storage is the key to this solution Failover Cluster Instance for Local HA and Database Mirroring for DR • • • • • SQL Server 2012: Multi-site FCI for HA and DR Primary Site Node 1 Active DR Site Windows Server Failover Cluster Node 3 Node 2 Passive SQL-FCI Storage Replication Passive Node 4 Passive SQL Server 2012: AG for HA and DR Primary Data Center Fileshare Witness Primary Disaster Recovery Data Center Windows Server Failover Cluster Secondary Secondary Synchronous Synchronous / Asynchronous Availability Group SQL Server 2012: FCI for HA, AG for DR Primary Site Windows Server Failover Cluster Node 3 Node 2 Node 1 DR Site Node 4 SQL-FCI-2 SQL-FCI-1 Secondary Primary Availability Group Lessons Learned from Customer Deployments • We will discuss two Tier-1 Mission-Critical deployments to relate lessons learned with specific scenarios • ServiceU • Microsoft IT SAP ERP Deployment Customer Deployment Example: ServiceU Corporation FCI + DBM (prior to SQL Server 2012) (FCIs for local high availability, DBM for Disaster Recovery) PRIMARY SECONDARY Database Mirroring Asynchronous Site 1 • Two independent Windows clusters – one at each site • Each site has a 3-node FCI • Using Disk-Only Quorum Site 2 FCI + AG (SQL Server 2012) (FCIs for local high availability, AG for Disaster Recovery) PRIMARY SECONDARY Availability Group Asynchronous Site 1 Site 2 • With SQL Server 2012 and AGs, it is a single Windows cluster spanning the two sites, instead of a different Windows cluster at each site. • NOTE: Asymmetric Storage requires Windows Server 2008 R2 + SP1 or QFE’s Lesson 1: Choose Appropriate Quorum Model • For mission-critical infrastructure, ServiceU has extra passive nodes at each site (3 node FCIs with only 1 active SQL Server instance) so that in the event of a hardware failure, still have extra node for local HA • ServiceU SLA demands that SQL Server be online, with no user intervention, as long as one node (out of 3) is up in the primary site => Last Man Standing • Node Majority or Node+Fileshare Majority can’t provide Last Man Standing • In FCI+AG configuration across two sites, the storage is asymmetric storage. Can asymmetric disk resource be used as quorum? 14 FCI + AG (SQL Server 2012) (FCIs for local high availability, AG for Disaster Recovery) PRIMARY SECONDARY Solution: Windows Server 2008 now supports Asymmetric Disk-Only Quorum Availability Group Asynchronous Site 1 Site 2 Asymmetric Disk-Only Quorum • Must be configured with cluster.exe; not supported in GUI • Requires testing and thorough knowledge of Windows clustering • Allows “Last Man Standing” Lesson 2: Understand Operational Procedures for DR Scenario: Disaster = entire primary site is unavailable Considerations • #1: Must force quorum on the DR site to get Windows cluster back online (any quorum model) • Impact: change operational procedures • /FQ switch seems to work faster than the GUI • #2: Rethink quorum model after disaster (any quorum model) • Re-establishing quorum is a Windows cluster admin job, not DBA job. • Until this is done you cannot bring SQL online. • Who needs to be involved? • #3: Changing quorum model is a multi-step process for ServiceU • Convert from disk-only to node majority • Then, convert from node majority to disk-only (with a separate new disk resource at the DR site) Ensure the process is documented and understood by all players. 16 Lesson 3: Instance Name & Path Plan the file paths and instance names before starting the first installation • Default installation paths on shared disk: • ..\MSSQL11.InstanceName\MSSQL\DATA • ..\MSSQL11.InstanceName\MSSQL\Log • Requirements for AG between two FCIs: • DIFFERENT instance names between FCIs • Recommendation for all AG deployments: • SAME file path across all FCIs • Solution: Be sure to use a path for Data and Logs that is always the same for all instances participating in an AG. File path recommendations is the same as in DBM, but this is more significant now because it is one Windows cluster, and the two FCI instance names can’t be 17 the same like they could in DBM between two FCIs. Lesson 4: Concurrent FCI setup on multiple nodes will set possible owners incorrectly Initial setup • Node1 and Node2 added during the initial install. • Both are possible owners of the FCI What happened during a concurrent install on Node3 and Node4: 1. Node3 setup started (possible owners=Node1 & Node2) 2. Node4 setup started a few seconds later (possible owners=Node1 & Node2) 3. Node3 setup finished: possible owners set to previous owners (Node1 & Node2) PLUS this node (Node3) 4. Node4 setup finished: possible owners set to previous owners (Node1 & Node2) PLUS this node (Node4) 5. End result: Possible owners=Node1, Node2, Node4 Correct result: Possible owners=Node1, Node2, Node3, Node4 Solution: DO NOT to use concurrent installation – it is unsupported and NOT recommended. 18 Lesson 5: After adding a node, possible owners = every node in the Windows Cluster Example: After the DR nodes were added to the Windows Cluster, the DR nodes show up as possible owners of the primary site FCI. Solution: Manually remove invalid owners from every resource (every time you 19 add a node). Not necessary on disks. Lesson 6: Failover Cluster Manager shows as if a FCI can be moved to a node that is not a possible owner Issue #3: Move operations allow selection of nodes that are not possible owners 20 Customer Deployment Example: MSIT SAP ERP Deployment DBM + LS (Prior to SQL Server 2012) Database Mirroring for local HA, Log Shipping for DR Production Witness Synchronous DBM Log Shipping Disaster Recovery Test SAP Volume Test and Integration System Image of production 22 Deployment Architecture: SQL Server 2012 Availability Group for HA and DR File share for Cluster Quorum Production Sync 1 1 1 Production Availability Group Log on Shipping production DBMS cluster 7+ 5+ Tb EMC CX3-80 SAN Sync SAP production CI cluster containing File Share quorum for DBMS cluster SAP test CI cluster containing File Share quorum for test DBMS cluster 0 Test Main Site DR Site 7+ Tb EMC CX3-80 SAN Test Availability Group on test DBMS cluster SAP Volume Test and Integration System Image of production 23 Lesson 1: Working across organizational boundaries • Working with the Windows Team in IT • Organization: Separate team from the DBA team • DBA team underestimated the role of Windows Cluster • Challenge: Concept of a Windows Cluster with “no” (?) shared storage Lesson 2: Choose Appropriate Quorum Model Choice of Quorum Model • What do we chose: Node Majority vs Node+Fileshare Majority • The Windows IT team didn’t know anything other than a shared storage quorum • Picked Node+Fileshare Majority • Node votes • Don’t want issues at DR data center or the connectivity between the data centers affect HA in the primary data center • Assigned zero vote to DR server • Lesson 3: Monitoring and Alerting AG • Alerting compared to DBM • Challenge: No DBM Monitor • Developed custom scripts for monitoring and alerting key measures • Alternative: System Center Operations Manager Lesson 4: DO NOT use Windows Failover Cluster Manager to perform AG failover • • DO NOT change the preferred owners and possible owners settings for the AG. DO NOT change the preferred owners and possible owners settings for the AG listener. • DO NOT move AG from one to another node using the FCM. • Solution: Use SSMS or T-SQL Questions ? [email protected] Thank You!

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download AlwaysOn Design Patterns and Lessons Learned.ppt