Download AlwaysOn Design Patterns and Lessons Learned.ppt

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Database model wikipedia , lookup

Relational model wikipedia , lookup

Clusterpoint wikipedia , lookup

Open Database Connectivity wikipedia , lookup

SQL wikipedia , lookup

Object-relational impedance mismatch wikipedia , lookup

PL/SQL wikipedia , lookup

Microsoft SQL Server wikipedia , lookup

Transcript
SQL Server 2012:
AlwaysOn HA and DR Design Patterns,
and Lessons Learned
from Early Customer Deployments
Sanjay Mishra
SQLCAT
Who and Why
• Who am I?
• Why I am speaking about this?
Setting the Stage
•
•
•
AlwaysOn ≠ Availability Groups
AlwaysOn = { SQL Server Failover Cluster Instances, Availability Groups }
Availability Groups ≠ Database Mirroring
Key Learnings from Early Customer Deployments
•
•
•
•
•
•
•
•
•
•
•
•
•
Windows Cluster (Windows Server Failover Cluster, WSFC)
is the foundation for HA and DR in SQL Server 2012 AlwaysOn
AlwaysOn inherits all “characteristics” of Windows Cluster
Windows Cluster
every single AlwaysOn deployment is a Windows Cluster
deployment
Windows Cluster
must understand Windows Cluster if you want to deploy, operate,
monitor, troubleshoot, administer AlwaysOn
key areas are: quorum model, cluster network communication, DR
procedures, cluster.exe, PowerShell
Windows Cluster
≠ SQL Cluster (SQL Server Failover Cluster Instance)
therefore, is NOT necessarily a shared-storage cluster
Windows Cluster
many key enhancements have been made to Windows Cluster
specifically for SQL Server 2012 AlwaysOn
•
Asymmetric Disk
•
Node Votes
•
Asymmetric Disk as Quorum resource
Key Learnings from Early Customer Deployments
•
•
•
•
•
•
•
•
•
•
•
•
•
Organizational structure
Typically, teams and skills are organized into separate groups – SQL
Server DBA team and Windows Server Admin team
AlwaysOn reaches out beyond the SQL Server DBA
DBAs need to work closely with Windows / Network Administration
teams
•
Not just for initial deployment, but for troubleshooting and
disaster recovery as well
Historical experience
need to unlearn and relearn a few things if you are already
experienced with Windows Cluster, but new to AlwaysOn
For example, if you haven’t read the Windows Cluster
documentation in the last few months, it is worth a re-read now
New/Different Tools for administration and troubleshooting
Windows cluster log
Failover Cluster Manager
Powershell
cluster.exe
Knowledge of PowerShell and cluster.exe command lines will come
very handy
SQL Server 2012 AlwaysOn HA+DR Design Patterns
SQL Server 2012 AlwaysOn
Solution Characteristics
HA+DR Solution
Multi-site Failover Cluster
1 Instance (FCI) for HA and
DR
2
Availability Group for HA
and DR
Failover Cluster Instance for
3 local HA +
Availability Group for DR
Corresponding
Pre-Denali Solution
•
•
•
•
•
Shared Storage solution *
Instance Level HA (automatic)
Instance Level DR (automatic *)
Uses storage replication
Doesn’t require database to be in FULL
recovery model
Multi-site FCI using stretch
VLAN
•
•
•
•
•
Non-Shared Storage solution
(Group of) Database Level HA (automatic)
(Group of) Database Level DR (manual)
DR replica can be Active Secondary
Requires database to be in FULL recovery
model
Database Mirroring for Local
HA and Log Shipping for DR
•
Combined Shared Storage and Non-Shared
Storage
Instance Level HA (automatic)
(Group of) Database Level DR (manual)
DR replica can be Active Secondary
Requires database to be in FULL recovery
model
Asymmetric storage is the key to this solution
Failover Cluster Instance for
Local HA and
Database Mirroring for DR
•
•
•
•
•
SQL Server 2012: Multi-site FCI for HA and DR
Primary Site
Node 1
Active
DR Site
Windows Server Failover Cluster
Node 3
Node 2
Passive
SQL-FCI
Storage Replication
Passive
Node 4
Passive
SQL Server 2012: AG for HA and DR
Primary Data Center
Fileshare Witness
Primary
Disaster Recovery Data Center
Windows Server Failover Cluster
Secondary
Secondary
Synchronous
Synchronous / Asynchronous
Availability Group
SQL Server 2012: FCI for HA, AG for DR
Primary Site
Windows Server Failover Cluster
Node 3
Node 2
Node 1
DR Site
Node 4
SQL-FCI-2
SQL-FCI-1
Secondary
Primary
Availability Group
Lessons Learned from Customer Deployments
•
We will discuss two Tier-1 Mission-Critical deployments to relate
lessons learned with specific scenarios
• ServiceU
• Microsoft IT SAP ERP Deployment
Customer Deployment Example:
ServiceU Corporation
FCI + DBM (prior to SQL Server 2012)
(FCIs for local high availability, DBM for Disaster Recovery)
PRIMARY
SECONDARY
Database Mirroring
Asynchronous
Site 1
• Two independent Windows clusters – one at each site
• Each site has a 3-node FCI
• Using Disk-Only Quorum
Site 2
FCI + AG (SQL Server 2012)
(FCIs for local high availability, AG for Disaster Recovery)
PRIMARY
SECONDARY
Availability Group
Asynchronous
Site 1
Site 2
• With SQL Server 2012 and AGs, it is a single Windows cluster spanning the two
sites, instead of a different Windows cluster at each site.
• NOTE: Asymmetric Storage requires Windows Server 2008 R2 + SP1 or QFE’s
Lesson 1: Choose Appropriate Quorum Model
• For mission-critical infrastructure, ServiceU has extra passive nodes at each site
(3 node FCIs with only 1 active SQL Server instance) so that in the event of a
hardware failure, still have extra node for local HA
• ServiceU SLA demands that SQL Server be online, with no user intervention, as
long as one node (out of 3) is up in the primary site => Last Man Standing
• Node Majority or Node+Fileshare Majority can’t provide Last Man Standing
• In FCI+AG configuration across two sites, the storage is asymmetric storage.
Can asymmetric disk resource be used as quorum?
14
FCI + AG (SQL Server 2012)
(FCIs for local high availability, AG for Disaster Recovery)
PRIMARY
SECONDARY
Solution: Windows Server 2008 now supports Asymmetric Disk-Only Quorum
Availability Group
Asynchronous
Site 1
Site 2
Asymmetric Disk-Only Quorum
• Must be configured with cluster.exe; not supported in GUI
• Requires testing and thorough knowledge of Windows clustering
• Allows “Last Man Standing”
Lesson 2: Understand Operational Procedures for DR
Scenario: Disaster = entire primary site is unavailable
Considerations
• #1: Must force quorum on the DR site to get Windows cluster back online (any quorum
model)
• Impact: change operational procedures
• /FQ switch seems to work faster than the GUI
• #2: Rethink quorum model after disaster (any quorum model)
• Re-establishing quorum is a Windows cluster admin job, not DBA job.
• Until this is done you cannot bring SQL online.
• Who needs to be involved?
• #3: Changing quorum model is a multi-step process for ServiceU
• Convert from disk-only to node majority
• Then, convert from node majority to disk-only (with a separate new disk resource
at the DR site)
Ensure the process is documented and understood by all players.
16
Lesson 3: Instance Name & Path
Plan the file paths and instance names before starting the first installation
• Default installation paths on shared disk:
• ..\MSSQL11.InstanceName\MSSQL\DATA
• ..\MSSQL11.InstanceName\MSSQL\Log
• Requirements for AG between two FCIs:
• DIFFERENT instance names between FCIs
• Recommendation for all AG deployments:
• SAME file path across all FCIs
• Solution: Be sure to use a path for Data and Logs that is always the same for
all instances participating in an AG.
File path recommendations is the same as in DBM, but this is more significant
now because it is one Windows cluster, and the two FCI instance names can’t be
17
the same like they could in DBM between two FCIs.
Lesson 4: Concurrent FCI setup on multiple
nodes will set possible owners incorrectly
Initial setup
• Node1 and Node2 added during the initial install.
• Both are possible owners of the FCI
What happened during a concurrent install on Node3 and Node4:
1. Node3 setup started (possible owners=Node1 & Node2)
2. Node4 setup started a few seconds later (possible owners=Node1 &
Node2)
3. Node3 setup finished: possible owners set to previous owners (Node1 &
Node2) PLUS this node (Node3)
4. Node4 setup finished: possible owners set to previous owners (Node1 &
Node2) PLUS this node (Node4)
5. End result: Possible owners=Node1, Node2, Node4
Correct result: Possible owners=Node1, Node2, Node3, Node4
Solution: DO NOT to use concurrent installation – it is unsupported and NOT
recommended.
18
Lesson 5: After adding a node, possible owners =
every node in the Windows Cluster
Example: After the DR nodes were added to the Windows Cluster, the DR nodes show up as
possible owners of the primary site FCI.
Solution: Manually remove invalid owners from every resource (every time you
19
add a node). Not necessary on disks.
Lesson 6: Failover Cluster Manager shows as if a FCI
can be moved to a node that is not a possible owner
Issue #3: Move operations allow selection of nodes that are not possible owners
20
Customer Deployment Example:
MSIT SAP ERP Deployment
DBM + LS (Prior to SQL Server 2012)
Database Mirroring for local HA, Log Shipping for DR
Production
Witness
Synchronous DBM
Log Shipping
Disaster Recovery
Test
SAP Volume Test and
Integration System
Image of production
22
Deployment Architecture: SQL Server 2012
Availability Group for HA and DR
File share for
Cluster Quorum
Production
Sync
1
1
1
Production Availability Group
Log on
Shipping
production DBMS cluster
7+
5+ Tb
EMC
CX3-80
SAN
Sync
SAP production CI cluster containing
File Share quorum for DBMS cluster
SAP test CI cluster containing File
Share quorum for test DBMS cluster
0
Test
Main Site
DR Site
7+ Tb
EMC
CX3-80
SAN
Test Availability Group
on test DBMS cluster
SAP Volume Test and Integration System
Image of production
23
Lesson 1: Working across organizational boundaries
•
Working with the Windows Team in IT
• Organization: Separate team from the DBA team
• DBA team underestimated the role of Windows Cluster
• Challenge: Concept of a Windows Cluster with “no” (?) shared
storage
Lesson 2: Choose Appropriate Quorum Model
Choice of Quorum Model
• What do we chose: Node Majority vs Node+Fileshare Majority
• The Windows IT team didn’t know anything other than a
shared storage quorum
• Picked Node+Fileshare Majority
• Node votes
• Don’t want issues at DR data center or the connectivity
between the data centers affect HA in the primary data center
• Assigned zero vote to DR server
•
Lesson 3: Monitoring and Alerting AG
•
Alerting compared to DBM
• Challenge: No DBM Monitor
• Developed custom scripts for monitoring and alerting key
measures
• Alternative: System Center Operations Manager
Lesson 4: DO NOT use Windows Failover
Cluster Manager to perform AG failover
•
•
DO NOT change the preferred owners and possible owners settings for the AG.
DO NOT change the preferred owners and possible owners settings for the AG
listener.
• DO NOT move AG from one to another node using the FCM.
• Solution: Use SSMS or T-SQL
Questions ?
[email protected]
Thank You!