Pro SQL Server Always On Availability Groups (PDFDrive)
Pro SQL Server Always On Availability Groups (PDFDrive)
Always On
Availability
Groups
Enterprise-level high-availability and
disaster recovery
—
Uttam Parui
Vivek Sanil
Pro SQL Server
Always On Availability
Groups
Uttam Parui
Vivek Sanil
Pro SQL Server Always On Availability Groups
Uttam Parui Vivek Sanil
Mooresville, North Carolina San Antonio, Texas
USA USA
ISBN-13 (pbk): 978-1-4842-2070-2 ISBN-13 (electronic): 978-1-4842-2071-9
DOI 10.1007/978-1-4842-2071-9
Library of Congress Control Number: 2016955944
Copyright © 2016 by Uttam Parui and Vivek Sanil
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the
material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information
storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now
known or hereafter developed.
Trademarked names, logos, and images may appear in this book. Rather than use a trademark symbol with
every occurrence of a trademarked name, logo, or image we use the names, logos, and images only in an
editorial fashion and to the benefit of the trademark owner, with no intention of infringement of the trademark.
The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are
not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to
proprietary rights.
While the advice and information in this book are believed to be true and accurate at the date of publication,
neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or
omissions that may be made. The publisher makes no warranty, express or implied, with respect to the
material contained herein.
Managing Director: Welmoed Spahr
Lead Editor: Jonathan Gennick
Development Editor: Corbin Collins
Technical Reviewer: Rahul Deshmukh
Editorial Board: Steve Anglin, Pramila Balan, Laura Berendson, Aaron Black, Louise Corrigan,
Jonathan Gennick, Todd Green, Robert Hutchinson, Celestin Suresh John, Nikhil Karkal,
James Markham, Susan McDermott, Matthew Moodie, Natalie Pao, Gwenan Spearing
Coordinating Editor: Jill Balzano
Copy Editor: Karen Jameson
Compositor: SPi Global
Indexer: SPi Global
Artist: SPi Global
Distributed to the book trade worldwide by Springer Science+Business Media New York, 233 Spring Street,
6th Floor, New York, NY 10013. Phone 1-800-SPRINGER, fax (201) 348-4505, e-mail orders-ny@springer-
sbm.com, or visit www.springer.com. Apress Media, LLC is a California LLC and the sole member (owner)
is Springer Science + Business Media Finance Inc (SSBM Finance Inc). SSBM Finance Inc is a Delaware
corporation.
For information on translations, please e-mail [email protected], or visit www.apress.com.
Apress and friends of ED books may be purchased in bulk for academic, corporate, or promotional use.
eBook versions and licenses are also available for most titles. For more information, reference our Special
Bulk Sales–eBook Licensing web page at www.apress.com/bulk-sales.
Any source code or other supplementary materials referenced by the author in this text are available to
readers at www.apress.com. For detailed information about how to locate your book’s source code, go to
www.apress.com/source-code/. Readers can also access source code at SpringerLink in the Supplementary
Material section for each chapter.
Printed on acid-free paper
Contents at a Glance
iii
■ CONTENTS AT A GLANCE
iv
Contents
Summary .......................................................................................................................... 8
■Chapter 2: Introduction to Always On.................................................................... 9
Common Solutions before Always On .............................................................................. 9
Failover Cluster Instances (FCI) .............................................................................................................. 9
Database Mirroring ............................................................................................................................... 10
Log Shipping ......................................................................................................................................... 11
Replication ............................................................................................................................................ 12
Limitations and Concerns ..................................................................................................................... 14
v
■ CONTENTS
Summary ........................................................................................................................ 16
Summary ........................................................................................................................ 33
■Chapter 4: Data Synchronization Internals.......................................................... 35
Trade-offs and Opportunity Cost .................................................................................... 35
Synchronous-Commit Mode ........................................................................................... 35
Asynchronous-Commit Mode ......................................................................................... 37
Synchronization Behavior in Various Scenarios ............................................................. 39
Scenario 1: Secondary Replica Goes Offline......................................................................................... 39
Scenario 2: Primary Replica Goes Offline ............................................................................................. 41
Summary ........................................................................................................................ 44
vi
■ CONTENTS
vii
■ CONTENTS
Summary ........................................................................................................................ 92
■Chapter 8: Create Availability Groups.................................................................. 93
Step 1: Enable the Always On Availability Groups Feature ............................................. 93
Step 2: Invoke Availability Group Wizard ........................................................................ 97
Step 3: Select Availability Group Name .......................................................................... 98
Step 4: Select Databases ............................................................................................. 100
Step 5: Specify Replicas ............................................................................................... 102
Step 6: Configure Endpoint ........................................................................................... 106
Step 7: Configure Backup Preferences......................................................................... 106
Step 8: Configure Listener ............................................................................................ 108
Step 9: Select Initial Data Synchronization .................................................................. 112
Step 10: Validate........................................................................................................... 114
Other Ways to Create an Availability Group .................................................................. 117
Availability group dialog box............................................................................................................... 117
T-SQL .................................................................................................................................................. 118
PowerShell.......................................................................................................................................... 124
viii
■ CONTENTS
ix
■ CONTENTS
x
■ CONTENTS
xi
■ CONTENTS
xii
■ CONTENTS
xiii
About the Authors
Vivek Sanil is a Senior Premier Field Engineer at Microsoft. Vivek has been
working extensively with SQL Server since joining the industry more than 16
years ago. He has been with Microsoft since 2005. Vivek currently works in a
support and consulting role in a dedicated capacity with a few large Fortune
500 customers. Areas of specialty include Database Engine internals,
SQL Always On, and performance troubleshooting. He has architected and
developed numerous PFE workshops that Microsoft delivers to its
customers. Vivek has also moderated in several worldwide Microsoft Virtual
Academy events. He can be reached at [email protected].
xv
About the Technical Reviewer
xvii
Acknowledgments
xix
PART I
Getting Started
CHAPTER 1
In a perfect world, data will always remain available no matter what. But in the real world, we know that
there can be multiple problems that can cause data to become unavailable.
For example, hospitals depend on patient data to be available all the time; otherwise patient care can be
affected. Let’s say that a surgeon is in the operating room and needs to refer to some of the patient’s critical
information. If the surgeon is unable to access the patient data in a reasonable amount of time, it could
result in the loss of the patient’s life. Another example is a 24/7 online retailer such as Amazon.com whose
customer-facing website must be up and running at all times. If the customer cannot see the merchandise
on the website or cannot place an order, the website is considered unavailable from the customer’s point of
view and may cause a loss of sale and revenue.
It is very important to have people with the right skill set, roles, responsibilities, and leadership in place.
Processes play a very important role, too. Processes should be documented, followed, have clear ownership,
and be practiced and updated as the business application and workload changes with time. Technology
needs to be able to meet the business goals for achieving high availability, be easy to deploy,and supported
and maintained.
Calculating Availability
Availability is usually expressed as a percentage of uptime in a given year and is calculated as the following:
Actual uptime
Availability 100%
Expected uptime
The resulting value is expressed by the number of 9’s that the solution provides. For example, let’s say
that the total downtime in a given year (365 days x 24 hours = 8760 hours) for a particular high available SQL
Server application is one hour. Then, let’s calculate the following:
8760 1
Availability 100% 99.9% or three nines
8760
Table 1-1 shows the translation from a given availability percentage to the corresponding time a system
would be unavailable.
4
CHAPTER 1 ■ HIGH AVAILABILITY AND DISASTER RECOVERY CONCEPTS
■ Note As the requirement for the number of 9’s or uptime increases, the cost of the system increases, and
vice versa. In order to achieve a maximum annual downtime of 5 minutes, 15 seconds (i.e., 5 nines) or 31.5
seconds (i.e., 6 nines), you need to eliminate all single points of failures and allow online replacement of all
production-related resources.
We have seen customers interpreting availability in many different ways. For example, for some
applications, scheduled downtime may not affect the end users. For such applications, customers exclude
the scheduled downtime from the availability calculations. By doing this, they claim to have very high
availability that may seem to give an impression of continuous availability, which in reality is rare and very
costly.
Another example is a scenario where the SQL Server may be up and running for all the days in a
year but the end users could not access the data for 5 hours due to a network outage. In this scenario, the
SQL Server is up but the services are not available for 5 hours due to the network outage. The database
administrators (DBAs) may claim 100% uptime. However, using the above formula for calculating
availability, the application will be 99.9% available, or 3 nines.
Causes of Downtime
Every minute of downtime affects the business either directly (such as loss in sales, revenue, decreased
productivity) or indirectly (loss of goodwill). Figure 1-2 shows some common causes of downtime.
5
CHAPTER 1 ■ HIGH AVAILABILITY AND DISASTER RECOVERY CONCEPTS
6
CHAPTER 1 ■ HIGH AVAILABILITY AND DISASTER RECOVERY CONCEPTS
Many customers use high availability and disaster recovery interchangeably because it is possible to
have a solution that provides both high availability and disaster recovery. However, it is not necessary that if
a solution provides high availability, it will also provide disaster recovery and vice versa. In fact, it is common
to see most of our customers have some form of high availability. However, disaster recovery is mostly on
their to-do project list, which is often never implemented and paid attention to until a disaster actually
occurs.
Recovery Objectives
Before implementing a disaster recovery plan, it is important to understand what costs that a disaster can
incur to the business. For example, say your datacenter is destroyed by a hurricane. How long will it take
to recover and return to normal operations? How much data loss can your business accept? To design a
disaster recovery plan, you have three main requirements for the business unit:
• Recovery Time Objective (RTO). It is the maximum allowable downtime when a
failure occurs. In other words, it is the maximum acceptable amount of time by
business to restore normal operations after a failure. After a failure occurs, your
primary goal is to restore normal operations to the point that business can continue
normally.
• Recovery Point Objective (RPO). It is the maximum acceptable level of data loss
after a failure occurs. It represents the point in time to which the lost data can be
recovered.
7
CHAPTER 1 ■ HIGH AVAILABILITY AND DISASTER RECOVERY CONCEPTS
• Recovery Level Objective (RLO). It defines the granularity with which the data needs to
be recovered after a failure occurs. For an SQL Server, it will define whether you must
be able to recover the whole instance or database or a group of databases or tables.
RTO, RPO, and RLO requirements must be documented in the service-level agreements (SLAs) and
should be reviewed periodically and changed as your environment changes. For example, let’s say your
database size increases from 250 GB to 2.5 TB in a year. This will significantly change the RTO as now you
need to work with a database that is 10 times larger as compared to when the original RTO was documented.
Most DBAs are not aware of RTO, RPO, or RLO requirements. Without knowing these requirements,
DBAs risk a huge mismatch between what they are able to recover after a failure occurs and what a business
needs to effectively survive a failure. Therefore, it is very important to understand the RTO, RPO, and RLO
requirements before a disaster occurs and proactively design a disaster recovery plan and test it periodically.
■ Note To ensure that the DR plans are up to date; complete; and can meet your RPO, RTO, and RLO
requirements, they should be periodically tested and updated (at least once every six months). DR plans should
be practiced by the same people who will be involved in the recovery process should a disaster occur.
Summary
High availability and disaster recovery solutions differ from each other but are often erroneously
interchanged. While HA solution is about preventing a failure from affecting the availability of data, DR
solution is about recovery from a failure and resuming high availability of data. Most customers spend their
time and resources in prevention and have some form of HA solution but only 50% of them have a DR plan.
Out of the 50%, only half have ever tested their DR plans, which is equivalent to not having one at all. While
prevention is important, it is equally important to have a solid DR plan that you can follow to recover with
minimal loss of data, productivity, and revenue should a real disaster occur.
In the next chapter, we will discuss the common SQL Server solutions that are available to build end-to-
end HA and DR solutions for your mission-critical applications.
8
CHAPTER 2
Introduction to Always On
In today's fast evolving business and commerce landscape, businesses need to be up all the time. Downtime
equates to revenue loss, and businesses are looking for ways to increase availability and improve their
disaster recovery capability. The IT professional architecting solutions and the IT professional supporting
such solutions are increasingly feeling the heat from businesses to implement cutting-edge technologies to
achieve their uptime goals.
A recent incident, where the New York Stock Exchange (NYSE) was shut down, caused huge trading
losses as trades could not be made for almost four hours. This incident highlights the importance of high
availability and the impact of downtime on businesses. There are numerous such incidents happening every
day and the IT professionals have to use all the options they have in their arsenal to mitigate such risks. To
cater to this need, Microsoft introduced and provided a new high availability and disaster recovery solution
called Always On in SQL Server 2012. This feature was further enhanced in SQL Server 2014 and then in SQL
Server 2016.
FCI can also be stretched across multiple datacenters residing in the same or different subnets. On the
network, an FCI appears to be an instance of an SQL Server running on a single computer. The clients
connect to a single virtual network name (VNN) for the FCI. When FCI fails over from one WSFC node to
another, the clients use the same VNN to connect to the SQL Server instance even after a failover. SQL Server
data for the instance is stored in the shared storage. On failover the second node connects to the shared
storage and presents the data back to the user. There is no data duplication/redundancy within the failover
cluster configuration.
Figure 2-1 shows a typical SQL failover cluster setup. The client PCs connect over the public network
and the cluster nodes communicate with each other over the private network(optional). You do not need to
have two separate networks. Both public connectivity and intracluster communication can occur over one
network provided there is redundancy for the network.
■ Note FCI is now a part of the SQL Server Always On Offering and renamed/rebranded as Always On
Failover Cluster instance.
Database Mirroring
FCI provided an instance and node level failover. Database mirroring is a solution to increase database
availability by supporting almost instantaneous failover at the database level. As the name suggests, a
mirrored database is an exact copy of the database that has been mirrored. Objects outside the database
are not copied to the mirror server as part of the database mirroring synchronization process. Database
mirroring can be used to maintain a single standby database, or mirror database, for a corresponding
production database that is referred to as the principal database. For automatic failover capability, a
witness server can be added to the configuration along with synchronous mirroring. Both synchronous and
asynchronous mirroring can be set up within the same datacenter or across datacenters.
10
CHAPTER 2 ■ INTRODUCTION TO ALWAYS ON
Figure 2-2 shows a typical database mirroring setup. Asynchronous mirroring is typically set up
between two SQL Server instances in separate datacenters for disaster recovery purposes. And synchronous
mirroring is typically set up between two SQL Server instances in the same datacenter for high availability.
Witness provides the automatic failover capability.
■ Note Database mirroring feature is deprecated and will be removed in a future version of SQL Server.
Avoid using this feature in new development work, and plan to modify applications that currently use this
feature. It is recommended that you use Always On Availability Groups instead.
Log Shipping
Similar to database mirroring, log shipping operates at the database level. You can use log shipping to
maintain one or more warm standby databases (referred to as secondary databases) for a single production
database that is referred to as the primary database.
Imagine you are driving a car and suddenly you realize that you have a flat tire. You are in luck — you
have a spare tire in the trunk. You pull over and replace the flat tire with the spare one. Log shipping standby
databases are like spare tires. If the primary database becomes unavailable for some reason, you can failover
to your secondary database. If the secondary database needs to be kept up to date using transaction log
backups, then the following things need to happen:
• Transaction log backups need to happen periodically on the primary database
(recovery model needs to set to full recovery model on the primary database for
transaction log backups to be allowed).
• The transaction log backups need to be copied over to the secondary.
• The transaction log backups need to be restored on to the secondary database.
In log shipping, the backup, copy, and restore activities are managed through SQL Server agent jobs that
are created when log shipping is configured. In newer cars nowadays, sensors notify the driver if the tire air
pressure drops below a certain limit. Similarly, in log shipping, a monitor server does the job of the sensors
by monitoring the health of the backup, copy, and restore jobs. And if they are not healthy, they can be
configured to raise alerts and notify administrators.
Figure 2-3 shows a typical log shipping setup. Log shipping is typically set up between multiple SQL
instances in separate datacenters for disaster recovery purposes.
11
CHAPTER 2 ■ INTRODUCTION TO ALWAYS ON
Replication
Replication is like a magazine subscription. People subscribe to magazines (articles) and get the latest copy
weekly, biweekly, or monthly (depending on the frequency that they choose). The magazines are distributed
by agents. Similarly, in SQL Server, the objects being replicated are called articles, and articles make up a
publication. The SQL Server instance hosting the publication is called the publisher. SQL Server or non-SQL
Server targets requiring data updates from the publisher and the publication are called the subscribers.
The multiple agents (snapshot, log reader, distribution agents) within SQL Server do the job of replicating
the data from the publisher to the subscriber. The magazine example is just to provide a simple overview of
replication.
There is a lot more to replication. There are different types of replication that implement different
methods of moving data from the publisher to the subscriber and in some cases getting the changes from the
subscriber back to the publisher. In replication, you do not have to necessarily replicate the entire database.
You can choose the tables/objects that you want to replicate from within a database. When replication was
first introduced, it was with the intention to provide a read/write copy for reporting purposes, where the
reporting applications would connect to the copy of the production data on the subscriber and not have to
go against the production database itself. Replication has evolved over the years and new types have been
added for scalability and high availability. Since it’s an additional copy of the data, it can be leveraged for
high availability/disaster recovery in certain scenarios.
• Snapshot replication
Snapshot replication is the most basic type of replication. As the name suggests, it
distributes data exactly as it appears at a specific point in time (snapshot). Changes
to data are not monitored; hence when data synchronization happens, a snapshot of
the entire publication is generated and sent to subscribers. Snapshot replication is
ideal if the data changes infrequently and if you are replicating small volumes of data
or if a large volume of changes occurs over a short period of time.
12
CHAPTER 2 ■ INTRODUCTION TO ALWAYS ON
• Transactional replication
This is the most common type of replication used by our customers. Transactional
replication typically starts with a snapshot of the publication database objects
and data. Subsequent data changes at the publisher are monitored and delivered
to the subscriber as they happen. Basically, incremental changes are propagated
to subscribers as they occur. This replication type is ideal, where the application
requires low latency between the time that changes are made at the publisher and
when the changes arrive at the subscriber. This replication type is also useful in
the case where the publisher has a very high volume of insert, update, and delete
activity.
• Merge replication
Just like transactional replication, merge replication typically starts with a snapshot
of the publication database objects and data. However, in this case, the data changes
to both the publisher and the subscriber are monitored with the help of triggers. The
changes from the publisher are delivered to the subscriber and vice versa. Conflicts
can occur; however it has the ability to detect and resolve them. It allows multiple
subscribers to update the same data at various times and propagate those changes to
the publisher and to other subscribers.
• Peer-to-Peer replication
Out of all the different replication types, Peer-to-Peer(P2P) replication is the
most likely to be used for high availability purposes. P2P replication was built on
transactional replication. It provides a scale-out and high availability solution by
maintaining copies of data across multiple server instances, also referred to as
nodes. As data is maintained across the nodes in near real time, P2P replication
provides data redundancy, which increases the availability of data.
Figure 2-4 shows a typical peer-to-peer replication setup. Read/write data is replicated across multiple
server instances in geographically separate locations.
13
CHAPTER 2 ■ INTRODUCTION TO ALWAYS ON
During the holiday season, I was shopping around for a TV. My simple expectation was a high definition
display and amazing surround sound. Unfortunately, I could not find one TV that had both. Every TV had
either great display or great sound. So I settled for a cheaper TV and also got a sound bar. The combination
of TV and sound bar was my solution for the TV viewing experience that I was looking for. Similarly, many
times one high availability/disaster recovery solution may not meet your business needs. In such situations,
you can use more than one solution.
For example, a very common high availability and disaster recovery solution that organizations use is to
use Failover Cluster Instances (FCIs) to protect SQL Server instances within each datacenter, combined with
asynchronous database mirroring, to provide disaster recovery capabilities for mission-critical databases.
Another common high availability and disaster recovery solution that organizations incorporate (when they
need more than one secondary datacenter or if they do not have shared storage) is database mirroring with a
witness within the primary datacenter combined with log shipping to move data to multiple locations.
14
CHAPTER 2 ■ INTRODUCTION TO ALWAYS ON
You can consider Always On Availability Groups as a greatly enhanced version of database mirroring.
Availability groups address most of the limitations that database mirroring had and enhances the high
availability and disaster recovery capability. It does this by providing an integrated set of options, including
automatic and manual failover of a group of databases, multiple secondaries, active secondaries, fast
application failover, and automatic page repair.
Always On Failover Cluster Instances enhance the existing SQL Server failover clustering and support
multi-subnet clustering, which helps enable cross-datacenter failover of SQL Server instances. Fast and
predictable instance failover is another key benefit, which helps ensure fast application recovery. Table 2-1
below provides a quick summary of the differences between the two Always On solutions.
Always On Availability Groups have been enhanced quite a bit since they were first introduced in
SQL Server 2012. In SQL Server 2012 it supported up to five availability replicas per availability group (one
primary and up to four secondary replicas). We will be discussing the concepts and the specifics in detail in
later chapters. Let’s take a look at some of the enhancements in the various versions over the years.
15
CHAPTER 2 ■ INTRODUCTION TO ALWAYS ON
Summary
As you can see, SQL Server offers a gamut of HA/DR solutions. Understanding their benefits and limitations
will equip you with making the right decision when it comes to designing your enterprise SQL Server
environment for HA/DR. Of these solutions, we will focus on availability groups for the remainder of this
book. Availability groups offer both HA and DR and so much more. A lot of improvements were made in SQL
Server 2016. Don’t worry if you are not familiar with some of the terms used in this chapter; it’s okay. This
chapter is just to provide an overview on Always On. We will be looking at the terms, terminology, concepts,
and implementation specifics in detail in the upcoming chapters. In the next chapter, we will take a look at
the availability group concepts and common topologies.
16
PART II
Always On Availability Groups (AGs) provide the best in class and cost-effective, high availability (HA)
and disaster recovery (DR) solutions for on-premises and cloud applications. Availability group solutions
are significantly improved as compared to previous SQL Server solutions like database mirroring and log
shipping. It provides an integrated set of options, including automatic and manual failover of a group of
databases, multiple secondaries, active secondaries, fast application failover, and automatic page repair.
As we are building the foundation for Always On Availability Group solutions, in this chapter we want to
define the terms that we will be using many times in the next chapters.
■ Note There is no enforced limit on the maximum number of availability groups (AGs) or availability
databases. Microsoft has extensively tested with 10 AGs and 100 databases per physical replica. The maximum
number depends on your specific environment (hardware, workload, etc.), and you need to thoroughly test the
solution in a test environment before deploying it in the production environment.
For a given availability group, the availability replicas must be hosted by separate instances of SQL
Server running on different nodes of the same Windows Server Failover Cluster (WSFC). Availability
groups rely on Windows Server Failover Cluster for health monitoring, failover coordination, and server
connectivity. Each SQL Server instance can be used for many availability groups, but each availability group
can host only one copy of a user database on an SQL Server instance.
■ Note Even though availability groups require Windows Server Failover Cluster, it does not require the SQL
Server instance to be clustered nor does it require shared disks. SQL Server instance can be stand-alone or
clustered. Most availability group solutions use stand-alone SQL Server instances and local hard disks.
An availability group is a WSFC role, previously referred to as a resource group. It defines a set of two or
more failover partners known as availability replicas. There are two types of availability replicas:
• Primary replica
• Secondary replica
Primary replica is the replica that hosts the primary databases and makes them available for read-write
connections to clients. Secondary replica is the replica that maintains a secondary copy of each primary
database and serves as a potential failover target for the availability group.
■ Note An availability group supports one primary replica and up to eight secondary replicas.
The secondary replica is an active replica as it supports read-only access to the secondary databases
for your reporting workload. Also, it allows performing backups and maintenance tasks such as database
integrity checks on the secondary database without impacting the production workload on the primary
replica. Figure 3-2 shows the availability replicas.
20
CHAPTER 3 ■ CONCEPTS AND COMMON TOPOLOGIES
Each of the readable secondary replicas can be configured using one of the following three options:
• Yes – The secondary replica allows all connections with read-only workload.
• No – The secondary replica denies all connections. This is the default option.
• Read-intent only – The secondary replica allows connections to be made from
applications that explicitly specify the ‘ApplicationIntent=ReadOnly’ connection
string option.
■ Note An availability group listener is dedicated to a single availability group. Different availability groups
cannot share the same listener.
Availability group listener consists of a Domain Name System (DNS) listener name, listener port
number, and one or more IP addresses. Availability group listeners rely on the Windows Server Failover
Cluster (WSFC) to redirect the client connections in the event of availability group failures. When you create
an availability group listener, it becomes a cluster resource with an associated VNN, VIP address(es), and
availability group dependency.
When a primary replica goes offline and a secondary replica takes the role of the primary replica, the
availability group listener enables the clients to connect to the new primary replica. Also if read-only routing
is configured on the readable secondary replicas, read-intent client connections to the primary replica are
automatically redirected to the readable secondary replica. We will discuss read-only routing and readable
secondary replicas in detail in the chapter about readable secondary.
21
CHAPTER 3 ■ CONCEPTS AND COMMON TOPOLOGIES
■ Note Although an availability group listener allows failover redirection and read-only routing, client
connections are not required to use them. They can directly connect to the SQL Server instance without using
the listener.
Availability Mode
The primary replica also sends the transaction log blocks for each primary database to every secondary
database. The availability mode of each availability replica determines whether the primary replica waits to
commit the transactions on the database until the secondary replica has written the transactions log record
to disk. Availability groups have two availability modes:
• Synchronous-commit mode
• Asynchronous-commit mode
In synchronous-commit mode, the primary replica waits to send the transaction confirmation to the
clients until the secondary replica writes the transaction log records to disk. This mode allows zero data loss
at the cost of increased transaction latency. This mode is recommended when business requirements are to
have zero data loss and high availability.
In asynchronous-commit mode, the primary replica does not wait for the secondary replica to write the
transaction log records to disk. The primary replica sends the transaction confirmation to the clients as soon
as the transaction log blocks are persisted on the primary database. In this mode, the transaction latency is
reduced as compared to synchronous-commit mode, but since the transaction log blocks may arrive later,
it does not guarantee zero data loss. Asynchronous-commit mode is recommended for disaster recovery
solution where the availability replicas are separated by considerable distances. We will discuss the data
synchronization workings in detail in chapter on data synchronization internals.
■ Note Availability groups support one primary replica and up to eight secondary replicas. All the replicas
can run under asynchronous-commit mode, or up to three of them can run under synchronous-commit mode.
22
CHAPTER 3 ■ CONCEPTS AND COMMON TOPOLOGIES
■ Note Starting with SQL Server 2016, automatic failover replicas have been increased from two replicas to
three replicas. This new enhancement increases the resiliency and scale of the availability groups solution.
Automatic failover is supported only when the below conditions are satisfied:
1. Both the primary and secondary replicas are running under synchronous-
commit mode with the failover mode set to automatic.
2. Each secondary database is synchronized with its corresponding primary
database.
3. Windows Server Failover Clustering (WSFC) has quorum.
4. A failure condition defined by your flexible failover policy of the availability
group are met.
Do not worry if you are not familiar with WSFC quorum or flexible failover policy – we will be discussing
them in detail in Chapters 5 and 9, respectively. Table 3-1 shows the conditions required for automatic
failover (without data loss).
Availability group health monitoring in SQL Server 2012 and 2014 only monitored the health of the
primary SQL Server instance. This means that if the primary SQL Server instance is online and healthy but
an availability database goes offline or becomes corrupt, automatic failover will not be triggered.
■ Note Starting from SQL Server 2016, an availability group allows us to optionally configure the health
monitoring to also consider the health of the availability databases. This means that now an automatic failover
can be triggered when an availability database goes offline or becomes corrupt.
23
CHAPTER 3 ■ CONCEPTS AND COMMON TOPOLOGIES
Planned manual failover (without data loss) occurs when an administrator manually initiates the
failover to cause a synchronized secondary replica to take over the primary replica role with guaranteed data
protection. Manual failover is supported only when the below conditions are satisfied:
1. Both the primary and secondary replicas are running under synchronous-
commit mode.
2. Each secondary database is synchronized with its corresponding primary
database.
3. Windows Server Failover Clustering (WSFC) has a quorum.
Table 3-2 shows the conditions required for planned manual failover (without data loss).
Forced manual failover (with possible data loss) is the only type of failover that an administrator can
initiate when the primary replica is lost and the secondary replica is not failover ready or no secondary
replica is synchronized. This is the only form of failover supported by asynchronous-commit replicas
because they are never synchronized. Forcing failover requires that the WSFC cluster has quorum. Forced
manual failover is also referred to as forced failover. Table 3-3 shows the conditions required for forced
manual failover (with possible data loss).
■ Note If you issue a forced failover on a synchronized secondary replica, all secondary databases are
suspended. You need to manually resume each suspended database individually on all secondary replicas.
Forced manual failover is a disaster recovery option that allows an administrator to use a secondary
replica as a warm standby replica after a failure occurs. As there is a potential for data loss, the forced manual
failover option should be used with caution. The amount of data loss depends on whether any transaction
log blocks were not sent to the secondary replica before the failure. Table 3-4 shows the possibility of data
loss for a particular database on the replica after a forced manual failover.
24
CHAPTER 3 ■ CONCEPTS AND COMMON TOPOLOGIES
Availability mode of Secondary Replica Database Synchronization State Is data loss possible?
Synchronous-commit Synchronized No
Synchronous-commit Synchronizing Yes
Asynchronous-commit Synchronizing Yes
Common Topologies
Now that we have defined the most common terms, let’s discuss the commonly used Always On Availability
Groups topologies and their advantages and limitations. The most common topologies that we see in the
field are the following:
• Stand-alone to Stand-alone
• SQL Failover Clustered Instance (FCI) to Stand-alone and vice versa
• SQL FCI to SQL FCI
• Extend on-premises Always On Availability Groups to Microsoft Azure
• Always On Availability Groups in Azure Virtual Machines (VMs)
• Distributed Availability Groups (introduced in SQL Server 2016)
Stand-alone to Stand-alone
This is the most commonly used Always On Availability Groups topology. In fact, more than 90% of the
Always On Availability Groups solutions that we see in the field use this topology. In this topology, we have
a two or more node Windows Server Failover Clusters (WSFCs) with stand-alone instances of SQL Server
running on each cluster node. Figure 3-3 shows the simplest form of an Always On Availability Groups
solution implementation. The figure shows a two-node WSFC with two stand-alone instances of SQL Server
each running on separate nodes of the cluster.
25
CHAPTER 3 ■ CONCEPTS AND COMMON TOPOLOGIES
■ Note Since the SQL Server instances are stand-alone, there is no shared storage required in this topology.
Each stand-alone SQL Server instance has its own copy of the data stored in local storage.
In the simplest form, the above topology will have both the cluster nodes in the same datacenter and
the replicas running under synchronous-commit mode with the failover mode set to automatic. With this
configuration, you will achieve a highly available solution with automatic failover (without data loss). You
also have disaster recovery as there are two sets of databases (one on each node). Additionally, you can
configure the secondary replica as readable and offload the reporting workload to the secondary replica
without impacting the primary workload running on the primary replica.
The disadvantage to this configuration is that since both the nodes are in the same datacenter, there
is no datacenter level disaster recovery available. As Availability Groups support one primary replica and
up to eight secondary replicas, you can easily add more replicas and configure them to achieve the HA and
DR business requirements for your applications. One popular configuration to achieve high availability
and datacenter level disaster recovery is to add a third node in a disaster recovery datacenter. Figure 3-4
shows a similar topology. The replicas (primary replica and secondary replica 1) in the primary datacenter
are running under synchronous-commit mode with automatic failover to provide high availability. The
third replica (secondary replica 2) is in a disaster recovery datacenter and is running under asynchronous-
commit mode to provide datacenter level disaster recovery. Also, since the third replica is running under
asynchronous-commit mode, the performance of the Availability Groups solution will not be affected by the
network latency between the two datacenters.
Figure 3-4. Stand-alone to Stand-alone AG Topology using three-node WSFC for HA and DR
■ Note All the three nodes are participating in the same WSFC and do not use any shared storage. This is a
non-shared solution as the stand-alone SQL Server instances store the databases in local storage.
26
CHAPTER 3 ■ CONCEPTS AND COMMON TOPOLOGIES
If your budget does not allow adding a third replica but you still need to provide datacenter level
disaster recovery, then you can place the second replica in the disaster recovery datacenter. In this
configuration, you will need to set the availability mode appropriately based on the network latency between
the two datacenters. For example, you may have to set the availability mode to asynchronous-commit mode
when the datacenters are far apart and network latency is a problem.
As you can imagine, there are a lot of possibilities in the stand-alone to stand-alone topology to deploy a
complete HA and DR solution based on your business requirements.
There are a couple of things that you need to know if you use SQL FCI in an AG topology.
1. The SQL FCI and stand-alone SQL Server are all members of the same Windows
Server Failover Cluster.
2. Always On Availability Groups do not support automatic failovers to and from an
SQL FCI. SQL Server enforces this behavior by disabling the option for automatic
failover for any replica that is an SQL FCI. You can still have automatic failover
from one node to another node within the SQL FCI, but the availability replica on
the FCI can only manually fail over to or from another replica in the availability
group. Typically, this is not a problem as you do not want automatic failover from
the primary datacenter to the disaster recovery datacenter.
27
CHAPTER 3 ■ CONCEPTS AND COMMON TOPOLOGIES
3. SQL FCI can host only one replica for a given availability group across all its
cluster nodes. When an availability replica is running the SQL FCI, the possible
owners list for the AG will contain only the active SQL FCI node.
4. SQL FCI replica uses a shared storage solution. The shared storage solution is
shared only by the nodes within the SQL FCI and not between the availability
group replicas.
As mentioned earlier, Always On Availability Groups do not support automatic failovers to and from an
SQL FCI. So in this topology, similar to the previous topology, automatic failover will only be that of the SQL
FCI from one node to another within the SQL FCI.
Although Always On Availability Groups support SQL FCIs as availability replicas, it is not very
commonly used by our customers. Most customers are able to meet all their high availability and disaster
recovery requirements using the stand-alone to stand-alone availability group topology. We have seen this
topology used by customers who already had an SQL FCI and wanted to extend the environment to another
datacenter using a secondary replica purely for disaster recovery scenarios.
28
CHAPTER 3 ■ CONCEPTS AND COMMON TOPOLOGIES
Figure 3-7. Hybrid IT: Extend on-premises Always On AG to Microsoft Azure Topology
You can use the built-in ‘Add Azure Replica Wizard’ in SQL Server Management Studio to extend the
on-premises Always On Availability Group solution to include Azure replicas. Below are some benefits that
such a hybrid solution provides:
1. Microsoft Azure secondary replica will protect your application from issues
impacting your on-premises datacenter.
2. Microsoft Azure provides high availability mechanisms, such as service healing
for cloud services and failure recovery detection for the VMs that further
improves the availability of your replicas.
3. This solution will in most cases eliminate the need of an expensive disaster
recovery (DR) site, which will eliminate costs associated with DR site hardware,
maintenance, and operational costs.
4. Also, there is no charge for network traffic going into Microsoft Azure VMs. So,
the synchronization traffic from an on-premises primary replica to a Microsoft
Azure replica will be free of charge.
29
CHAPTER 3 ■ CONCEPTS AND COMMON TOPOLOGIES
Figure 3-8. Azure-only: Always On Availability Groups in Azure Virtual Machines Topology
To achieve high availability, the primary replica and secondary replica 1 are running in Azure VMs
within the same region under synchronous-commit mode. To achieve disaster recovery, the secondary
replica 2 is running in a different datacenter in Azure VMs under asynchronous-commit mode. This solution
provides cross-region disaster recovery.
■ Note In the Azure Management Portal, there is a new gallery setup for Always On Availability Groups with a
Listener. This allows you to configure everything you need for an Always On AG solution automatically.
30
CHAPTER 3 ■ CONCEPTS AND COMMON TOPOLOGIES
After reviewing the preceding topologies, you may have realized that they all need one single Windows
Server Failover Cluster (WSFC). But here one of the requirements is to use the two existing WSFCs. Another
requirement is to implement the solution without making a lot of big changes. That means you cannot work
with your infrastructure team to make both the datacenters in the same domain. Nor can you upgrade the
clusters to Windows Server 2016 and then create a single multi-domain cluster. So how can you accomplish
the goal of implementing an HA and DR solution using SQL Server 2016 Always On Availability Groups?
Before discussing the solution, let’s consider another scenario. Say you have two geographically
dispersed datacenters and there are different applications (App1 and App2) in each datacenter using
Always On Availability Groups for high availability. You have been asked to include disaster recovery for the
applications in each datacenter by using the other datacenter for disaster recovery.
One of the ways to implement a solution for both the scenarios is by combining Always On Availability
Groups with Log Shipping or Replication but you want to use Always On exclusively to make the solution
simple to deploy, administer, maintain, and troubleshoot.
SQL Server 2016 introduced a new feature called a Distributed Availability Group that will enable you
to accomplish both the above scenarios with minimal work. Distributed availability group lets you associate
two availability groups residing on different WSFC. Figure 3-9 shows the architecture of a distributed
availability group.
In Figure 3-9, there are two separate clusters (WSFC1 and WSFC2). WSFC1 is in the primary datacenter
and WSFC2 is in the secondary datacenter. Each cluster has its own availability group (AG1 and AG2)
with a matching configuration of databases. AG1 and AG2 are associated to each other with a distributed
availability group. AG1 is the primary availability group in this figure and AG2 is the secondary availability
group. All your read-write activities occur on the primary replica AlwaysOnN1, and then get replicated to
its secondary replicas (AlwaysOnN2). The changes are also replicated to the primary replica (AlwaysOnN3)
on the secondary availability group AG2 on WSFC2. The primary replica (AlwaysOnN3) replicates those
changes to its secondary replicas (AlwaysOnN4).
■ Note In a distributed availability group, the secondary availability group (AG2 in Figure 3-9) automatically
becomes read-only and read/write activities can occur only on the primary replica of the primary availability
group (AG1 in Figure 3-9).
31
CHAPTER 3 ■ CONCEPTS AND COMMON TOPOLOGIES
The topology shown in Figure 3-9 can be used to implement the solution for scenario 1 discussed in
this section. To implement a solution for scenario 2, you can create distributed availability groups for each
application (App1 and App2) as shown in Figure 3-10.
Figure 3-10 shows two distributed availability groups. Distributed availability group 1 associates AG1
and AG2, whereas distributed availability group 2 associates AG3 and AG4. All your read/write activities
for App1 occur on the primary replica AlwaysOnN1 of primary availability group AG1 and then gets
replicated to its secondary replica AlwaysOnN2. The changes also get replicated to the primary replica
AlwaysOnN3 of the secondary availability group AG2, which then replicates the changes to its secondary
replica AlwaysOnN4. Similarly, all read/write activity for App2 occurs on the primary replica AlwaysOnN4
of primary availability group AG3. The changes then get replicated to its secondary replica AlwaysOnN3 and
then to the primary replica AlwaysOnN2 of the secondary availability group AG4.
■ Note Figures 3-9 and 3-10 show asynchronous-commit mode between the primary and secondary availability
groups as typically this topology is used for geographically dispersed sites. If you have a high-speed, reliable
network between your sites then you can configure synchronous-commit mode between the availability groups.
32
CHAPTER 3 ■ CONCEPTS AND COMMON TOPOLOGIES
The following are some key differences between a regular availability group and a distributed
availability group:
• Since the distributed availability group has two WSFCs and each cluster maintains
its own quorum mode and node voting configuration, the health of the secondary
cluster does not affect the primary cluster. In a regular availability group topology,
we would need to exclude the cluster nodes at the DR site for voting to ensure that
the health of the cluster nodes in the DR site does not affect the health of the primary
site cluster nodes.
• In a distributed availability group, changes are replicated once over the network to
the secondary WSFC and then replicated within the cluster. In a regular availability
group with a single WSFC, the changes are replicated to each and every individual
replica. This makes distributed availability groups very efficient for geographically
dispersed sites.
• Automatic failover from primary availability group to secondary availability group is
not supported since automatic failover functionality is provided by WSFC, and there
is no single WSFC between the two sites in a distributed availability group.
• The operating system version on the primary and secondary WSFC can be different.
In a regular availability group, as there is only one WSFC, the operating system
version on all the cluster nodes has to be the same. Due to this advantage, you can
use distributed availability groups as one of the methods to perform rolling updates/
upgrades of the operating system.
• A distributed availability group is limited to two availability groups. However,
an availability group can be a member of more than one distributed availability
group. So, you can have distributed availability group 1 contain AG1 and AG2 and
distributed availability group 2 contain AG2 and AG3, with AG1 replicating to AG2
and AG2 replicating to AG3.
Summary
This chapter introduced the concepts and topologies that are important to understand in order to effectively
build, configure, and manage SQL Server 2016 Always On Availability Groups High Availability (HA) and
Disaster Recovery (DR) solutions.
In the next chapter, we will discuss how transaction log blocks are replicated from the primary replica to
a secondary replica in synchronous and asynchronous modes.
33
CHAPTER 4
Now that we have looked at the Availability Groups concepts and common topologies, it’s time to dive
deeper into the internals. We have already seen that, Always On Availability Groups support two availability
modes (i.e., asynchronous-commit mode and synchronous-commit mode). In this chapter, we will try to
understand how the replication of a transaction log blocks work from the primary replica to a secondary
replica in synchronous-commit and asynchronous-commit modes.
Synchronous-Commit Mode
Figure 4-1 gives an overview of how the replication of transaction log blocks from the primary replica to a
secondary replica works in synchronous-commit mode. For synchronous-commit to occur, both the current
primary replica and the secondary replica in question must be configured for synchronous-commit.
■ Note Log block is a contiguous chunk of memory (often 60K) and maintained by the Log Manager.
3. When the log block becomes full or the primary replica issues a commit
operation, the log blocks from the log cache are flushed to the log file to make it
persistent. In an Always On Availability Group configuration, when the log block
is being flushed to the disk on the primary replica, it also gets copied to the log
pool.
4. The log blocks in the log pool are read by a thread called log capture, and its
job is to read the log blocks from the log pool and send them to the secondary
replica. In case of multiple secondary replicas, there is one log capture thread for
each replica that ensures that the log blocks are sent across multiple replicas in
parallel. The log content gets compressed and encrypted before being sent over
to the secondary replicas.
36
CHAPTER 4 ■ DATA SYNCHRONIZATION INTERNALS
5. On the secondary replica, a thread called log receive receives the log blocks from
the network.
6. It writes to the log cache on the secondary replica.
7. There’s one redo thread per database that is always running on the secondary
replica. While the log blocks are being written to the log cache, it reads those log
blocks and applies those changes to the data pages and the index pages in the
database on the secondary to bring it up to date with whatever has happened on
the primary replica. When the log block on the secondary replica becomes full,
or it receives a commit log record, it hardens the content of log cache onto the log
disk on the secondary replica.
8. If the secondary replica is configured to run in synchronous mode, it will send
an acknowledgment on the commit to the primary node indicating that it has
hardened the transaction, and so it is safe to tell the user that the transaction is
committed. And because the log has been hardened on the secondary, there is a
guarantee that in case there is a failover, there will be no data loss.
9. The redo thread runs independently of how log blocks are being generated on
the secondary or being copied and persisted. If the redo thread is a running few
minutes behind, those log blocks may not be available in the log cache. In that
case, the redo thread will pick up those log blocks from the log disk, and that is
what is shown in the dotted line on the right side of the figure above.
■ Note Availability group replicas ping each other to signal that they are still active. If a replica receives a
ping back from the other replica during the session-timeout period (default 10 seconds), it indicates that the
replicas are up and running and are communicating to each other. If the primary replica does not get a ping
response from the secondary replica within the session-timeout period, the primary replica temporarily shifts
into asynchronous-commit mode for that secondary replica. When the secondary replica reconnects with the
primary replica, they resume synchronous-commit mode.
Asynchronous-Commit Mode
Figure 4-2 gives an overview of how the replication of transaction log blocks from the primary replica to a
secondary replica works in an asynchronous-commit mode. The process is similar to synchronous mode
except that the acknowledge message of a successful commit is sent after the log blocks are persisted on the
primary replica’s transaction log.
37
CHAPTER 4 ■ DATA SYNCHRONIZATION INTERNALS
38
CHAPTER 4 ■ DATA SYNCHRONIZATION INTERNALS
7. There’s a redo thread that is always running on the secondary replica. While
the log blocks are being written to the log cache, it reads those log blocks and
applies those changes to the data pages and the index pages in the database on
the secondary to bring it up to date with whatever has happened on the primary
replica. When the log cache on the secondary replica becomes full, or it receives
a commit log record, it hardens the content of log cache onto the log disk on the
secondary replica.
8. The redo thread runs independently of how log blocks are being generated on
the secondary or being copied and persisted. If the redo thread is running a few
minutes behind, those log blocks may not be available in the log cache. In that
case, the redo thread will pick up those log blocks from the log disk, and that is
what is shown in the dotted line on the right side of the figure above.
■ Note Starting with SQL Server 2016, the synchronization throughput of Availability Groups has increased
~10x due to improvements in the data synchronization process. The performance improvements include
parallel and faster compression of log blocks on the primary replica, an optimized synchronization protocol, and
parallel decompression and redo of log records on the secondary replica. In SQL Server 2012 and 2014, the
redo process is executed serially by a single thread and therefore bound to a single CPU core. For high-write
workloads the secondary replicas are no longer able to keep up, resulting in long downtimes during a failover.
In SQL Server 2016, the redo is executed in parallel to make use of all the available CPU cores. The database
recovery times on failover has improved, and the freshness of data on the secondary has increased due to
these improvements.
39
CHAPTER 4 ■ DATA SYNCHRONIZATION INTERNALS
to the secondary replica database before it is introduced back in the availability group. For a very large
database or a busy database with frequent transaction log backups, this could be difficult to achieve during
business hours. In such cases, it may be beneficial to add the databases back during off hours.
40
CHAPTER 4 ■ DATA SYNCHRONIZATION INTERNALS
* If you issue a forced failover command on a synchronized secondary replica, the secondary replica behaves
the same as for a manual failover.
■ Note Recovery queue is the amount of log records in the log files of the secondary replica that has not yet
been redone. The amount of log in the recovery queue, speed of the system, and the workload would determine
the amount of time required to complete the roll forward process in the secondary database.
41
CHAPTER 4 ■ DATA SYNCHRONIZATION INTERNALS
3. At this point, the secondary replica becomes the new primary replica. Its
databases become the primary databases. The new primary replica rolls back any
uncommitted transactions (the undo phase of recovery) as quickly as possible.
Rollback occurs in the background while clients use the database. Committed
transactions are not rolled back as part of this process.
4. Until the secondary databases connect and resynchronize to the new primary
database, the databases are marked as NOT SYNCHRONIZED. Before the
rollback recovery starts, secondary databases can connect to the new primary
databases and quickly transition to the SYNCHRONIZED state.
5. The original primary replica becomes the secondary replica and its databases
become secondary databases when the replica comes back online again. And it
resynchronizes with the corresponding primary replica and the databases. After
the databases have resynchronized, failover is possible again, however in the
reverse direction this time.
42
CHAPTER 4 ■ DATA SYNCHRONIZATION INTERNALS
■ Note We recommend forcing failover only if you must restore service to your availability databases
immediately and are willing to risk losing data.
43
CHAPTER 4 ■ DATA SYNCHRONIZATION INTERNALS
1. In the above example, before the primary replica goes offline, the last hardened
LSN on the primary replica is 100 while that of the asynchronous secondary
replica is 50.
2. After the primary replica goes offline and a forced failover is initiated, the
secondary replica becomes the new primary replica and marks its last hardened
LSN as 50.
3. Once the old primary replica is brought online, it shows its synchronization as
suspended.
4. If the synchronization on the old primary is resumed, it synchronizes with the
new primary; sends its last hardened LSN as 100; and when it sees that the last
hardened LSN of the new primary is 50, it rolls back its transaction log to LSN 50
and from that LSN onwards, starts accepting the transaction log blocks from the
primary replica. Thus data loss is propagated from the primary to the secondary
replica if the synchronization is resumed.
Summary
This chapter should have helped you understand the internal workings of data synchronization within
Availability Groups. In the next chapter, we will take a look at one of the key requirements for Availability
Groups: that is, Windows Server Failover Clustering.
44
CHAPTER 5
In the previous chapters, we mentioned that Always On solutions leverage the Windows Server operating
system and Windows Server Failover Cluster (WSFC) as a platform technology. For a given Always On
Availability Group, the availability replicas must be hosted by separate instances of SQL Server running
on different nodes of the same WSFC. Availability groups rely on WSFC for health monitoring, failover
coordination, and server connectivity. Hence it’s crucial that SQL Server database architects and
administrators have a better understanding of WSFC.
To understand the various components that make up the Windows Server failover cluster, we will define
the most common failover clustering terms.
Cluster Node
The independent servers that work together in a failover cluster are referred to as a cluster node. Each
cluster node has a working installation of Microsoft Windows Server operating system.
■ Note In Windows Server 2012 R2 and earlier versions, all the cluster nodes were required to be members
of the same domain. Starting with Windows Server 2016, you can create failover clusters without active
directory dependencies.
46
CHAPTER 5 ■ INTRODUCTION TO WINDOWS SERVER FAILOVER CLUSTERING
Networks
To ensure smooth cluster operations, it is important to have redundant and reliable communications among
all cluster nodes. Properly configured connectivity between the cluster nodes is required to provide access to
the highly available services required by the clients and to guarantee the internal cluster communications.
To avoid a single point of failure, two or more independent networks must be connected to all the
cluster nodes. A network component failure (NIC, router, switch, hub, etc.) should not cause the cluster
communication to break down and communication should continue to make progress in a timely manner.
If cluster communication cannot proceed on one network, the switchover to the other cluster-enabled
network is automatic. This is why the cluster nodes must have multiple network adapters available to
support cluster communications and each one should be connected to a different switch.
■ Note Starting with Windows Server 2008 clustering, there is no concept of a public and private network.
In previous versions of Windows failover cluster, the public network provided clients with access to
the highly available cluster services and the private network was used for internal cluster communications.
Starting from Windows Server 2008 clustering (and above), cluster communications go over all networks
that can communicate between the nodes. It’s part of the cluster health process to see if the network is up
and can talk to the other node.
Cluster Resource
A cluster resource is a physical or logical entity that can do the following:
• be managed by a failover cluster,
• be brought online or taken offline,
• be hosted (owned) by only one cluster node at any point of time.
Examples of cluster resources are a network name, IP address, disk, and Availability Group listener.
Resource Dependency
Resource dependency is a two-way association between resources. If a resource A depends on resource B,
then resource B is a dependency of resource A. As an example, a network name resource depends on an IP
address resource. This means that for a network name resource to be online, an IP address resource needs to
be online first. Also, if you take the network name resource offline, the IP address resource will be first taken
offline followed by the network name resource. A resource can be dependent on multiple resources using
AND or OR logic.
Role
A role, previously referred to as resource group, is a collection of resources that are required to run a specific
application or service. An Availability Group is a role.
47
CHAPTER 5 ■ INTRODUCTION TO WINDOWS SERVER FAILOVER CLUSTERING
Failover
Failover is the process of moving a group from one cluster node to another in the case of a failure. Windows
server failover cluster attempts to perform a failover of a group when the following happens:
• A cluster node failure or a resource failure occurs and the restart threshold is hit
Or
• An administrator performs a manual failover
Failback
Failback is the process of moving a group to the preferred owner. When a node becomes inactive for some
reason, the failover cluster fails over the groups hosted by the node. When the node becomes active again,
the failover cluster can fail back the groups originally hosted by the node. By default, failback is disabled on
the groups. The default setting is useful as you do not want the groups to be in a ping-pong situation when
they fail over from one node to another and then fail back to the original node as soon as it is active and
then again fail over if the original node becomes inactive again. It is better to first evaluate the cause of the
original failure and remediate the issue and only after ensuring that the node is healthy is when you should
manually fail back the groups to the original node.
■ Note Even though the default failback settings work for most scenarios, you can configure the failback to
occur during a specific time period, say between 8 p.m. to midnight. If you do configure failback, it is important
to set the failback time such that failback does not happen during peak business hours.
Preferred Owner
Preferred owner (also referred to as preferred node) is the cluster node where you want the resource group
to run. Most customers do not set the preferred owner for a group because it does not matter where the
group resides. All that matters is that the group is running on one of the cluster nodes. When a node fails, the
remaining nodes take over the groups from the failed node. This will increase the load on the other nodes
and unless it has the capacity to take on more loads, performance will be affected. Hence, it is important
to design the failover cluster such that the nodes can host the groups in case of failover with minimum
performance impact. Failback does not occur if a preferred owner is not configured for the group.
■ Note One strong motivation to use preferred owners with availability groups is due to licensing. Availability
groups (AG) does not require a SQL Server license for one disaster recovery (DR) replica as long as it is truly
passive and the primary SQL Server instance is covered by active Software Assurance (SA). In this case,
Microsoft usually places a limit on how many days you can run the AG on the DR replica in a year. To ensure
that the limit is not exceeded, you can utilize preferred owners and scheduled failbacks.
48
CHAPTER 5 ■ INTRODUCTION TO WINDOWS SERVER FAILOVER CLUSTERING
Possible Owner
Possible owner (also referred to as possible node) is a cluster node that can run a resource. By default, all
cluster nodes are possible owners so that the resources can run on any node. In most cases, we use the default
settings. But if you have a scenario where you have a multi-node cluster and you do not want a particular
node to not run some resources, you can remove the node from the possible owners list of the resource.
Heartbeat
The failover cluster keeps track of the current state of the cluster nodes within a cluster and determines
when a group and its resources fail over to another cluster node. The communication takes place in the form
of messages that are sent regularly between each cluster node. These messages are called cluster heartbeats.
The cluster heartbeats let all nodes know which node is up and down. The cluster heartbeat mechanism
uses port 3343 and is unicast in nature and uses a Request-Reply type process. This provides for higher
security and more reliable packet accountability.
Table 5-1 shows the heartbeat configuration settings along with default and maximum values.
■ Tip To be more tolerant of transient failures on Windows Server 2012 R2 and below, it is recommended
to increase the SameSubnetThreshold and CrossSubnetThreshold values to the higher Windows Server 2016
values.
The delay parameter is the frequency at which the cluster heartbeats are sent between cluster nodes. It
is the number of seconds before the next heartbeat is sent. By default, for single subnet (SameSubnetDelay)
and multi-subnet clusters (CrossSubnetDelay), a heartbeat is sent every one second.
The threshold parameter is the number of heartbeats that a cluster node can miss before the failover
cluster takes recovery action. By default, for single subnet (SameSubnetThreshold) and multi-subnet
(CrossSubnetThreshold) clusters, if a node misses a series of five heartbeats, the cluster considers the node
to be unreachable and takes recovery action.
CrossSiteDelay and CrossSiteThreshold are new parameters for Windows Server 2016 failover clusters
that have nodes in different fault domain sites.
49
CHAPTER 5 ■ INTRODUCTION TO WINDOWS SERVER FAILOVER CLUSTERING
Quorum
Mission-critical workloads hosted on failover clusters are considered highly available if the cluster nodes
that are hosting the resources are up. However, the cluster generally requires more than half the nodes to
be up and running, which is known as having a quorum. The quorum of a failover cluster determines the
number of failures the cluster can sustain while still remaining online. If an additional failure occurs beyond
this threshold, the cluster will stop running.
Quorum is extremely important for any high availability solution and is designed to handle the scenario
when there is a communication problem between sets of cluster nodes, so that two nodes do not try to
simultaneously host a resource group and write to the same disk at the same time. This is known as split-brain
syndrome, and you need to prevent this to avoid any potential corruption to a disk by having two simultaneous
disk owners. The failover cluster forces the cluster service to stop in one of the subsets of nodes to ensure that
there is only one true owner of a particular resource group. Once the stopped nodes can communicate with the
main group of nodes, they will automatically rejoin the cluster and start the cluster service.
Quorum is based on a voting algorithm where more than half of the voters must be online and be able
to communicate to each other. The concept is similar to a parliament house or committee. If there are too
few members present, then the committee does not have a quorum (required number of votes) and cannot
hold an election. More than half of the total votes is required to achieve quorum to avoid having a tie in the
number of votes in a partition, since majority will mean that the other partition has less than half the votes.
In a five-node failover cluster, three voters must be online; yet in a four-node failover cluster, three voters
must also remain online to have majority. Because of this logic, it is recommended to have an odd number of
voters in the cluster. A voter in a cluster can be the following:
• A cluster node – every cluster node has one vote.
• Disk witness, file share witness, or cloud witness – either a disk witness, file share
witness, or a cloud witness can have one vote in the cluster but not multiple disks,
multiple file shares, multiple cloud witnesses, nor any combination of the three.
Cloud witness is only available in Windows Server 2016.
■ Note If a healthy failover cluster loses quorum, failover cluster is designed to stop the cluster services
causing our mission-critical applications such as Always On Availability Groups to shut down. Therefore, it is
critical to ensure that Quorum is always maintained or the cluster service will be stopped.
50
CHAPTER 5 ■ INTRODUCTION TO WINDOWS SERVER FAILOVER CLUSTERING
Dynamic Quorum
To understand this concept, let’s assume you have a five-node failover cluster. If you lose three nodes, the
remaining nodes would go offline even though there are two nodes remaining. To avoid this scenario,
starting from Windows Server 2012 failover clustering, a new functionality called dynamic quorum was
introduced. As the name implies, dynamic quorum adjusts the quorum dynamically. This allows the failover
cluster to continue to function when less than 50% of the nodes are active.
Dynamic quorum works only if the nodes are shut down sequentially. So, in the five-node failover
cluster scenario above, assuming you didn’t lose all the three nodes at the same time, as cluster nodes went
offline, the number of votes in the quorum would adjust dynamically. When the first node failed, you would
in theory have a four-node cluster. When the second node failed, you would then have a three-node cluster,
and so on. If you continued to lose cluster nodes one by one, you could go all the way down to a one-node
cluster also referred to as Last Man Standing and still remain online.
■ Note Dynamic Quorum is enabled by default when a Windows Server 2012 (or above) failover cluster is
created.
Dynamic Witness
Even with dynamic quorum functionality, the administrator had to determine whether to add a quorum
witness (disk or file share) or not depending on the number of nodes to make it always an odd number.
The administrator had to monitor and manually keep changing the value as nodes were added or evicted
from the cluster. To relieve this burden from the administrator, starting with Windows Server 2012 R2,
dynamic witness is enabled by default when the failover cluster is created. As the name suggests, dynamic
witness automatically takes care of when to use the quorum witness or not, relieving the burden from the
administrator.
■ Note The new recommendation for Windows Server 2012 R2 (and above) failover clusters is to include
the quorum witness and the cluster will decide when to use its vote or not. Windows Server 2012 R2 has two
types of quorum witnesses (disk and file share) and Windows Server 2016 has an additional witness type (cloud
witness).
If there is an even number of nodes in the cluster, then the cluster will consider the witness as a vote
(1) and will make the total votes an odd number. If there is an odd number of nodes in the cluster, then the
cluster will not consider the witness vote (0) as you already have an odd number. If the witness goes offline,
then the cluster makes the witness vote as 0 to prevent any unnecessary issues due to quorum failures.
51
CHAPTER 5 ■ INTRODUCTION TO WINDOWS SERVER FAILOVER CLUSTERING
As discussed earlier in this chapter, if one cluster node fails, a process called failover automatically shifts
the workload of the failed node to another working node in the cluster. This ability of a failover cluster to
handle failures allows the cluster to meet two important requirements:
• High availability – the ability to provide the end users access to a service for a high
percentage of time while reducing unscheduled outages.
• High reliability – the ability to reduce the frequency of system failures.
Because the mission-critical services are not bound to any specific cluster node, interruption to any
such node only has a minor impact on the overall availability of the service. On the other hand, a single
server solution would be completely offline during a patch window or disaster that affected the host. On a
multi-node cluster, the hosted mission-critical services can be easily moved from the node that needs to
be taken offline for maintenance or disaster recovery, which greatly improves the overall uptime of the
mission-critical application.
Another benefit of failover cluster is Scalability. Starting with Windows Server 2012 (and above),
failover clustering supports up to 64 cluster nodes. This helps to easily scale out the solution by adding
nodes to it. For example, if you have a two-node failover cluster hosting applications that is getting close
to hitting a node-specific capacity limit such as CPU or memory, it is very easy to add a third node and
redistribute the load.
The total cost of ownership (TCO) has been reduced for failover clusters as the failover clustering
feature is now included (starting from Windows Server 2012 and above) in Standard Edition. With this
change, small- and medium-size businesses can take advantage of high availability in their standard server
deployments.
Although failover clustering has a lot of benefits it is important to understand its limitations, too.
Failover clustering does not protect from these:
• Shared storage failures,
• Network failures,
• Hack attacks,
• Operational errors.
Failover clustering does not provide load balancing and fault tolerance. Fault tolerant servers are those
that have complete redundancy across all components. If one component fails, the secondary component
takes over in a process that is seamless to the application running on the server. This is different from a
failover cluster where a hardware or software failure causes the workload to move from the failed node to
a healthy node. During the failover process, the end users are disconnected briefly and there is minimum
downtime incurred.
Summary
In this chapter, we provided an overview of Windows Server failover cluster and covered all the basics
of failover cluster. We defined the most common failover cluster terms and discussed the benefits and
limitations of failover cluster. In the next chapter, we will discuss the requirements and considerations for
deploying Always On Availability Groups.
52
CHAPTER 6
Prerequisites
Chapter 5 looked at Windows Server Failover Cluster (WSFC), one of the key requirements for Always On
Availability Groups. In this chapter, we will look at the prerequisites for deploying availability groups.
Windows Requirements
To support the Always On Availability Groups feature, ensure that every computer that is participating in one
or more Availability Groups meets the following fundamental requirements:
• OS is Windows Server 2008 or later
• Each computer should be running either x86 (non-WOW64) or x64 Windows
Server 2008 or later versions.
• In Windows Server 2012 R2 and previous versions, a cluster can only be created
between member nodes joined to the same domain. However, starting with
Windows Server 2016, you can now create a Failover Cluster without Active
Directory dependencies, as discussed in Chapter 7.
Failover clusters can be created in the following configurations:
• Single-domain Clusters - Clusters with all nodes joined to the same
domain.
• Multi-domain Clusters - Clusters with nodes that are members of different
domains.
• Workgroup Clusters - Clusters with nodes that are member servers /
workgroup (not domain joined).
• Hence SQL Server 2016 on Windows Server 2016 is supported in the following
environments:
• Cross domains (with trust)
• Cross domains (no trust)
• No domain at all
■ Note Windows Server 2016 clusters use certificates for intra-cluster authentication. Management of SQL
Server does not change; it uses certificate-secured endpoints internally like database mirroring.
■ Note Windows Server 2012 includes all the fixes recommended for availability groups on Windows Server
2008 R2.
54
CHAPTER 6 ■ PREREQUISITES
■ Note Distributed Availability Groups can be set up between availability groups in different WSFCs as seen
in Chapter 3. However, the nodes participating in the individual availability groups should be in the same WSFC.
Recommendations
Now let’s take a look at some recommendations for the computers that host availability replicas:
• OS - Windows Server 2012 or above
• Windows Server 2012 and above works best with availability groups and
includes all the fixes for Availability Groups.
• Can handle comparable workload
• For a given availability group, all the availability replicas should run on
comparable computers that can handle identical workloads.
• Sufficient disk space
• Every computer on which a SQL Server instance hosts an availability replica
must possess sufficient disk space for all the databases in the availability group.
• Dedicated network adapters
• Use a dedicated network adapter (NIC) for availability group data
synchronization to achieve best performance. The same network links
should be used for communications between WSFC cluster members and
communications between availability replicas. If separate network links are
used, then it can cause unexpected behaviors in case some of the links fail.
55
CHAPTER 6 ■ PREREQUISITES
■ Note In case of a distributed availability group, the SQL instances can reside on nodes in different WSFCs,
as a distributed availability groups can be set up between availability groups in different WSFCs. However, the
SQL Instances participating in the individual availability group should still be on the nodes of a single WSFC.
• Enterprise Edition
• Use SQL Server Enterprise Edition to configure availability groups on SQL
Server Instances.
■ Note Basic Availability Groups do not need SQL Server Enterprise Edition and is supported on SQL Server
Standard Edition.
56
CHAPTER 6 ■ PREREQUISITES
Summary
This chapter should have helped you understand the requirements and considerations for deploying
Availability Groups. You might want to use the above as a checklist for your availability groups deployment.
Now that you have a good understanding of the requirements and considerations for deploying availability
groups, in the next chapter we will take a look at how to create a WSFC.
58
PART III
After reviewing the requirements and considerations for creating Always On Availability Groups in chapter 6,
you understand that one of the key requirements for creating Always On Availability Groups is creating a
Windows Server Failover Cluster (WSFC). In this chapter we will provide step-by-step instructions to create a
Windows Server 2016 failover cluster using Failover Cluster Manager interface and Windows PowerShell.
The setup environment used in this book simulates an almost real-life production environment. It
consists of five servers:
• One Domain Controller (DC) named AlwaysOnDC.
• Two member servers named AlwaysOnN1 and AlwaysOnN2 in the primary
datacenter for high availability.
• Third member server named AlwaysOnN3 in the secondary datacenter for disaster
recovery.
• One client machine named AlwaysOnClient in the primary datacenter.
■ Note Most real-life production environments typically will have at least two domain controllers to provide
high availability and fault tolerance.
All the servers (AlwaysOnN1, AlwaysOnN2, and AlwaysOnN3) are members of the same domain
CORPNET.CONTOSO.COM and have Windows Server 2016 Standard Edition and stand-alone instances of
SQL Server 2016 Enterprise Edition installed. To provide a high availability and disaster recovery solution
using Always On Availability Groups, we will create a three-node Windows Server 2016 Failover Cluster.
Figure 7-1 shows the setup environment that we have used in the book.
■ Note Before proceeding to create the failover cluster, ensure that the setup account that will be used
to create the failover cluster has membership in the local Administrators group on all the servers that will
become cluster nodes and active directory Create Computer Objects and Read All Properties privileges in the
organization unit (OU) to create the computer objects.
■ Note The minimum permissions required to install the Failover Clustering feature on a server is
membership in the local Administrators group.
62
CHAPTER 7 ■ CREATE A WINDOWS SERVER FAILOVER CLUSTER
Figure 7-2. Accessing Add Roles and Features Wizard from Server Manager
This will invoke the Add Roles and Features Wizard as shown in Figure 7-3. Review the Before you begin
screen and then select Next.
63
CHAPTER 7 ■ CREATE A WINDOWS SERVER FAILOVER CLUSTER
64
CHAPTER 7 ■ CREATE A WINDOWS SERVER FAILOVER CLUSTER
Figure 7-5. Selecting the server to install the Failover Clustering feature
65
CHAPTER 7 ■ CREATE A WINDOWS SERVER FAILOVER CLUSTER
66
CHAPTER 7 ■ CREATE A WINDOWS SERVER FAILOVER CLUSTER
Figure 7-7. Adding additional tools to manage the Failover Clustering feature
67
CHAPTER 7 ■ CREATE A WINDOWS SERVER FAILOVER CLUSTER
Figure 7-8. Confirming the installation selections for the Failover Clustering feature
■ Note The installation of the Failover Clustering feature does not require a reboot, making it unnecessary to
check the Restart the destination server automatically if required check box.
When the Failover Clustering feature is installed, you will see the Results screen shown in Figure 7-9.
Click Close.
68
CHAPTER 7 ■ CREATE A WINDOWS SERVER FAILOVER CLUSTER
Figure 7-9. Results screen for installing the Failover Clustering feature
After you install the Failover Clustering feature, the Failover Cluster Manager interface is available in
the Tools menu of the Server Manager interface. Alternately, you can also open Failover Cluster Manager by
opening Control Panel then selecting Administrative Tools and Failover Cluster Manager.
■ Note Repeat the above steps to add the Failover Clustering feature to all the servers that you want to
include in the Windows Server 2016 Failover Cluster. Without installing the Failover Clustering feature on a
server, you cannot include it in a failover cluster.
Figure 7-10. Invoking the Cluster Validation Wizard using Failover Cluster Manager
This will invoke the Validate a Configuration Wizard (also referred to as the Cluster Validation wizard)
as shown in Figure 7-11. Review the Before You Begin screen and click Next to start the validation process.
70
CHAPTER 7 ■ CREATE A WINDOWS SERVER FAILOVER CLUSTER
Figure 7-11. Reviewing the Before you begin screen of the Cluster Validation Wizard
71
CHAPTER 7 ■ CREATE A WINDOWS SERVER FAILOVER CLUSTER
Figure 7-12. Entering the servers names that you want to include in the cluster
72
CHAPTER 7 ■ CREATE A WINDOWS SERVER FAILOVER CLUSTER
Figure 7-13. Selecting Run all tests in the Testing Options screen
The cluster validation wizard is a very powerful tool and is not only used for initial cluster validation,
but it is often used to troubleshoot an existing failover cluster. While troubleshooting an existing cluster, you
may not want to run all the tests and run only the tests that are affecting the issue. For such scenarios, select
the second testing option Run only tests I select. A common scenario is to uncheck some or all the storage
tests when you are not troubleshooting storage. If storage tests are selected, the disks assigned to cluster
roles will be taken offline during the tests, making them unavailable during the tests.
Confirm Selections
In the Confirmation screen as shown in Figure 7-14, review the settings and confirm that you want to run the
selected validation tests. Click Next to start the validation process.
73
CHAPTER 7 ■ CREATE A WINDOWS SERVER FAILOVER CLUSTER
Review Report
In the Summary screen, ensure that the tests completed successfully and the configuration is suitable for
clustering. A sample Summary screen is shown in Figure 7-15.
74
CHAPTER 7 ■ CREATE A WINDOWS SERVER FAILOVER CLUSTER
Click View Report to review the validation report. A sample validation report is shown in Figure 7-16.
Click Finish to close the wizard. The validation report is stored on the C:\Windows\Cluster\Reports folder
for future review.
75
CHAPTER 7 ■ CREATE A WINDOWS SERVER FAILOVER CLUSTER
■ Note If any tests fail, Microsoft does not consider the configuration to be supported. The problem must be
corrected and the cluster validation wizard rerun to confirm that the configuration passes all the tests.
Figure 7-17. Invoking Create Cluster action in the Failover Cluster Manager
This invokes the Create Cluster Wizard as shown in Figure 7-18. Review the Before You Begin screen
and click Next to start creating the failover cluster.
76
CHAPTER 7 ■ CREATE A WINDOWS SERVER FAILOVER CLUSTER
Figure 7-18. Reviewing the Before You Begin screen of the Create Cluster Wizard
Select Servers
In the Select Servers screen, enter the names of all the servers that will be part of the failover cluster
separated by a semicolon and click Add. Figure 7-19 shows the screen for our environment with the three
server names AlwaysOnN1, AlwaysOn2, and AlwaysOnN3. Click Next to continue.
77
CHAPTER 7 ■ CREATE A WINDOWS SERVER FAILOVER CLUSTER
Figure 7-19. Adding the servers that you want to include in the cluster
78
CHAPTER 7 ■ CREATE A WINDOWS SERVER FAILOVER CLUSTER
■ Note The cluster name object (CNO) needs to have Create Computer Objects privilege on the organization
unit (OU) it currently resides in to be able to create Virtual Computer Objects (VCOs). If the cluster name account
is deleted or permissions are removed, other accounts cannot be created until the cluster name account is
restored or the correct permissions are reinstated.
Also, in the Access Point for Administering the Cluster screen, select the networks you want to be used
and type the cluster IP address as shown in Figure 7-20. In this figure, we have two cluster IP addresses as we
have two subnets (one in primary datacenter and the other in the disaster datacenter) in our environment.
Click Next to continue.
Figure 7-20. Entering the cluster name and network address to access the failover cluster
■ Note If you are using a specific organizational unit (OU) like we have done in our setup environment,
ensure that the setup account is given the Create Computer Objects and Read all Properties on the OU.
79
CHAPTER 7 ■ CREATE A WINDOWS SERVER FAILOVER CLUSTER
Confirm Selections
Confirm the selections in the Confirmation screen as shown in Figure 7-21. Click Next to continue.
■ Note Uncheck Add all eligible storage to the cluster. By default, this box is checked. Failure to uncheck this
box will add all the storage (including the local drives) to the cluster. If you choose not to add all eligible storage
to the cluster, you can always add specific disks after the cluster is created.
80
CHAPTER 7 ■ CREATE A WINDOWS SERVER FAILOVER CLUSTER
Review Report
Review the Summary screen to ensure that the cluster is successfully created as shown in Figure 7-22.
Review any warnings that may be displayed. For example, the following warning is displayed in the
Summary screen for our setup:
An appropriate disk was not found for configuring a disk witness. The cluster is not
configured with a witness. As a best practice, configure a witness to help achieve the highest
availability of the cluster. If this cluster does not have shared storage, configure a File Share
Witness or a Cloud Witness.
This warning is expected in our environment as we do not have any shared disk in the environment.
Click the View Report to review the report created by the wizard. This report is stored on the C:\Windows\
Cluster\Reports folder for future review. Click Finish to close the wizard. The Failover Cluster Manager
interface will automatically connect to the cluster when the wizard closes as shown in Figure 7-23.
81
CHAPTER 7 ■ CREATE A WINDOWS SERVER FAILOVER CLUSTER
Figure 7-23. Failover Cluster Manager connected to the newly created cluster
82
CHAPTER 7 ■ CREATE A WINDOWS SERVER FAILOVER CLUSTER
This will open the Configure Cluster Quorum Wizard as shown in Figure 7-25. Review the Before You
Begin screen and click Next to continue.
83
CHAPTER 7 ■ CREATE A WINDOWS SERVER FAILOVER CLUSTER
Figure 7-25. Reviewing the Before You Begin Screen of Configure Cluster Quorum Wizard
84
CHAPTER 7 ■ CREATE A WINDOWS SERVER FAILOVER CLUSTER
85
CHAPTER 7 ■ CREATE A WINDOWS SERVER FAILOVER CLUSTER
Starting with Windows Server 2016 Technical Preview, failover cluster introduces the cloud witness
quorum type. The cloud witness quorum takes advantage of the Microsoft Azure public cloud as the witness
for the failover cluster. This can be very useful for multisite clusters that do not have a third site to place a
file share witness. The cloud witness acts the same as a file share witness, gets a vote, and can participate in
quorum calculations.
■ Note To configure a cloud witness quorum, you need a valid Microsoft Azure Storage Account Name and
Access Key corresponding to the Storage Account. This information is entered in the Configure cloud witness
screen that is invoked by selecting the option Configure a cloud witness in the Select Quorum Witness screen
shown in Figure 7-27.
86
CHAPTER 7 ■ CREATE A WINDOWS SERVER FAILOVER CLUSTER
87
CHAPTER 7 ■ CREATE A WINDOWS SERVER FAILOVER CLUSTER
Confirm Selections
Confirm the quorum settings on the Confirmation screen as shown in Figure 7-29 and then click Next to start
the quorum configuration.
88
CHAPTER 7 ■ CREATE A WINDOWS SERVER FAILOVER CLUSTER
Review Report
Review the Summary screen to ensure that the quorum is configured successfully as shown in Figure 7-30.
Click the View Report to review the report created by the wizard. This report is stored on the
C:\Windows\Cluster\Reports folder for future review. Click Finish to close the wizard.
■ Note To run the PowerShell cmdlets, open the PowerShell console with elevated privileges by right-clicking
the PowerShell app and selecting Run as administrator.
89
CHAPTER 7 ■ CREATE A WINDOWS SERVER FAILOVER CLUSTER
■ Note The -IncludeManagementTools switch in the above command installs the Failover Cluster
Management tools and Failover Cluster Module for Windows PowerShell.
If you want to install the Failover Clustering feature on other servers without having to log into them,
use the -ComputerName parameter as shown below:
To view the installed Failover Clustering feature, run the below PowerShell command:
Get-WindowsFeature *clus*
■ Note The -NoStorage parameter is specified to not add all eligible storage to the cluster.
90
CHAPTER 7 ■ CREATE A WINDOWS SERVER FAILOVER CLUSTER
■ Note The ability to create failover clusters without domains and support availability groups opens up a
number of new scenarios for our customers, and removes previous blocks that prevented migration from the
deprecated Database Mirroring technology to Always On Availability Groups.
In this section, we will demonstrate the steps to create a workgroup and/or multi-domain cluster. The
below prerequisites for single-domain clusters still apply for workgroup and multi-domain clusters:
• All the hardware and software components must be “Certified for Windows
Server 2016.”
• The fully configured solution (servers, network and storage) must pass all cluster
validation tests.
■ Note Similar to creating traditional single-domain failover clusters, Failover Cluster Manager interface and
Windows PowerShell can be used to create workgroup or multi-domain clusters.
In addition to the prerequisites of single-domain clusters, the below steps are needed to create a
workgroup and multi-domain cluster.
91
CHAPTER 7 ■ CREATE A WINDOWS SERVER FAILOVER CLUSTER
■ Caution Without setting the LocalAccountTokenFilterPolicy registry policy will result in Requested
registry access is not allowed error while trying to create a cluster using non-built-in administrator accounts.
Summary
In this chapter, we created Windows Server 2016 Failover Cluster, one of the key requirements for availability
groups, using Graphical User Interface (Server Manager and Failover Cluster Manager) and Windows
PowerShell. We also discussed the workgroup and multi-domain cluster deployment that is introduced in
Windows Server 2016. In the next chapter, we will create the Always On Availability Groups.
92
CHAPTER 8
In chapter 7 you created a Windows Server Failover Cluster (WSFC), one of the key requirements for
Always On Availability Groups. In this chapter, we will look at how to create and configure availability
groups.
The setup environment used for this chapter was introduced in the previous chapter. Just to recap, the
environment consists of five servers:
• 1 DC (AlwaysOnDC)
• 2 Nodes in the Primary datacenter (AlwaysOnN1 and AlwaysOnN2)
• 1 Node in the secondary datacenter (AlwaysOnN3)
• 1 Client machine (AlwaysOnClient)
A Windows Server 2016 Cluster (AlwaysOnCluster) was created on three nodes: AlwaysOnN1,
AlwaysOnN2, and AlwaysOnN3 in chapter 7.
This chapter assumes a stand-alone SQL Server 2016 Enterprise Edition installation has already
been performed on all machines that are going to be the replicas in the Always On Availability Group.
The SQL instances, AlwaysOnN1, AlwaysOnN2, and AlwaysOnN3 will be the replicas for the availability
groups.
Review chapter 6 for a detailed list of prerequisites before attempting to create an availability group.
Figure 8-1. SQL Server Service properties option in the SQL Server Configuration Manager
Figure 8-2 shows the SQL Server service properties box and Always On Availability Group tab with the
option to enable the Always On Availability Groups feature.
94
CHAPTER 8 ■ CREATE AVAILABILITY GROUPS
After the Properties window opens up, select the Always On High Availability tab and then check
the Enable Always On Availability Groups check box and click Apply. As soon as you click apply, you
will see a dialog box warning that the changes will not take effect until the service is restarted. Click ok
on the dialog box.
You might have noticed in the above figure that Windows Failover Cluster Name (AlwaysOnCluster)
shows up under the Always On High Availability tab. This is the Windows Cluster you created in chapter 7.
While you are in the SQL Server Service Properties window, also review the Log On Service account
being used for the SQL instance by clicking on the Log On tab. Figure 8-3 shows the Log On tab in the SQL
Server Service properties window.
The SQL Service Log On account should have access to the replica instances. To keep things simple, you
can also use the same Service Log On account on all the replica instances.
95
CHAPTER 8 ■ CREATE AVAILABILITY GROUPS
■ Note Group managed service accounts (gMSA) are now supported in SQL Server 2016. It is recommended
to use gMSA for Always On SQL Server instances. If gMSA cannot be used, then it is recommended to use the
same domain account as the logon account for all the replica SQL server instances.
MSAs are recommended when resources external to the SQL Server computer are needed, or else
Virtual Accounts are recommended on servers that are Windows Server 2008 R2 and higher. MMSAs
are managed domain accounts that provide automatic password management and simplified SPN
management, including delegation of management to other administrators within the domain. They are
much more secure than regular domain accounts. However, MSAs cannot be shared across multiple hosts;
hence they could not be used on the SQL servers hosting availability groups. gMSAs were introduced in
Windows Server 2012, which extends that functionality over multiple servers.
After clicking OK, go ahead and restart the SQL Server service during maintenance window to avoid
service interruption, to enable the Always On Availability Groups feature, and to apply the Log On account
change.
Repeat the above steps on all the SQL Server Instances that will be participating in the Always On
Availability Groups configuration as replicas.
To enable the Always On feature using PowerShell, either launch it with SSMS as shown in Figure 8-4 or
launch it from Windows and navigate to the SQL Server instance and run the Enable-SQLAlwaysOn cmdlet
to enable the feature. PowerShell will restart the SQL Server service itself and enable the feature. Figure 8-4
shows the PowerShell command to enable availability groups.
96
CHAPTER 8 ■ CREATE AVAILABILITY GROUPS
■ Note A new sample database (WideWorldImporters) has been released for SQL Server 2016. It is
supposed to be an upgrade over the old AdventureWorks database that has been around since SQL Server
2005. The new sample database can be downloaded from here: https://msdn.microsoft.com/library/
mt734199(v=sql.1).aspx
We have used AdventureWorks2016 database for our book as it will still be around. It can be downloaded from
here: https://www.microsoft.com/en-us/download/details.aspx?id=49502
We will be using the availability group Wizard to create an availability group. To invoke this wizard,
right-click either on the Always On High Availability node or availability groups node in SSMS.
Figure 8-6 shows the location in SSMS from where the New Availability Group Wizard can be launched.
An Introduction page appears the first time you run this wizard.
Figure 8-7 shows the introduction page.
To avoid this page from showing up in the future, you can click the “Do not show this page again” check
box and then click Next.
98
CHAPTER 8 ■ CREATE AVAILABILITY GROUPS
On the Specify Availability Group Name page, enter the name of the new availability group. This name
is only to identify the availability group. The clients will not be connecting to this name as they will be
connecting to the listener, which we will be configuring later.
Prior to SQL Server 2016, availability group health only monitored the health of the instance. A database
can be offline or corrupt, but as long as the instance itself is healthy, a failover won’t be triggered. SQL Server
2016 introduces database level health detection that allows you to optionally change the health monitoring
to also consider the health of the databases in the AG.
After choosing the Database Level Health Detection option, click Next.
99
CHAPTER 8 ■ CREATE AVAILABILITY GROUPS
As can be seen in the Status column, one of the database prerequisites for availability groups has not
been met. In this case, full backup has not been taken. The check box to select the database will be grayed
out, and you won’t be able to proceed to the next screen unless a full backup is taken.
100
CHAPTER 8 ■ CREATE AVAILABILITY GROUPS
■ Note For a complete list of all the prerequisites for availability groups, please refer to chapter 6.
Perform the full backup and on the select user database page, click Refresh. The status should now say
Meets prerequisites.
Figure 8-10 shows the Select user databases for the availability group page where the prerequisites have
been met.
Click Next.
101
CHAPTER 8 ■ CREATE AVAILABILITY GROUPS
■ Note In SQL Server 2012 and 2014, encrypted databases could be added to an availability group, but they could
not be added using the availability group wizard. Additionally, the encrypted databases could not be accessed in the
event of a failover. SQL Server 2016 adds support to the availability group wizard for adding encrypted databases
and simplifies database access following a failover. In SQL Server 2016, if you select an encrypted database in the
Select Databases page, the wizard detects that the databases is defined with a database master key and prompts the
administrator for the database master key password. After adding the database, the wizard will verify the password
during the validation phase. After you enter all the information and click Finish (Figure 8-24), the wizard creates the
credentials on each replica using the password of the database master key.
In the first tab Replicas, add the replicas that will be participating in the availability group. In this case,
AlwaysOnN2 and AlwaysOnN3 instances. Click the Add Replicas button to add new Replicas.
■ Note We will be covering the Add Azure Replica option later in chapter 18.
In the Connect to Server window, provide the SQL instance name that you want to configure as a replica
for the availability group. Click Connect and repeat the Add Replica step for other replica instances in the
availability group. In this case, that would be the AlwaysOnN3 instance.
Figure 8-13 shows the Specify Replicas page with all the replicas added.
103
CHAPTER 8 ■ CREATE AVAILABILITY GROUPS
Figure 8-13. Specify Replicas page with all the replicas added
■ Note In SQL Server 2016 you can have three replicas configured for Automatic Failover.
• Synchronous-Commit option
Select this checkbox to configure the replica to use synchronous-commit or leave it
blank to use asynchronous-commit mode. We discussed the data synchronization
modes in detail in chapter 4, the one about Data Synchronization Internals.
104
CHAPTER 8 ■ CREATE AVAILABILITY GROUPS
• Readable Secondary
Figure 8-14. Specify Replicas page with the options selected in the Replicas tab
105
CHAPTER 8 ■ CREATE AVAILABILITY GROUPS
By default, port 5022 will be used for the availability group, which is also the default database mirroring
port. You can specify a different port number in the port number column. You can also uncheck the Encrypt
Data check box if you don’t want the data sent over the endpoint to be encrypted.
■ Note Ensure that the Endpoint port (5022 in this example) has a firewall exception on all the replicas. If the
firewall exception is not there on this port, deployment of an availability group will fail with error.
It is recommended to use a dedicated network adapter for the availability groups to achieve optimal
performance. We are not using a dedicated network for the availability group setup in this chapter. If you
choose to do this, then make sure that the cluster uses the same network for cluster communication. We will
cover the TSQL commands to configure availability groups over a dedicated network in the TSQL section
later in this chapter.
106
CHAPTER 8 ■ CREATE AVAILABILITY GROUPS
Figure 8-16 shows the Specify Replicas page with the Backup Preferences tab. The options on this lets
you specify your backup preference for the availability group and your backup priorities for the individual
availability replicas.
The options here only matter for automated backups. They do not apply for manual backups. Database
maintenance plan and log shipping are aware of this setting and can leverage the preferences selected here.
For custom maintenance jobs, the script will have to contain the logic to use the preferences selected here.
We will be discussing the script and logic to use in detail later in chapter 11.
• Prefer Secondary (Default) - Backups occur on a secondary replica except when the
primary replica is the only replica online.
• Secondary only - Backups always occur on the secondary. Backups will not occur if
the primary is the only replica online.
107
CHAPTER 8 ■ CREATE AVAILABILITY GROUPS
• Primary - Backups always occur on the primary replica. You might want to choose
this option if differential backups are to be taken. Differential backups are not
supported on secondary replicas.
• Any Replica – Backups can occur on any replica; however backup priority will still be
applicable.
To set the backup priorities, set the values in the replica backup priorities grid. For the backup priority,
you can select any other integer between 1 and 100. One is the lowest priority, and 100 is the highest priority.
To exclude a replica, check the exclude replica box for that replica.
108
CHAPTER 8 ■ CREATE AVAILABILITY GROUPS
Choose the “Do not create an availability group listener now option” if you do not wish to create the
listener along with the availability group.
■ Note Ensure that the cluster name object (AlwaysOnCluster in our setup) has Create Computer Objects
and Read All Properties privileges on the organization unit (AlwaysOnOU in our setup) it currently resides in to
be able to create Virtual Computer Objects (AGCorp_Listen in our setup).
To create it along with an availability group, select the “Create an availability group listener” option.
Specify the network name in the Listener DNS name box and the port number in the Port box. Listener
Name must be unique in the domain. It can be alphanumeric and 15 characters long.
■ Note If you choose to use port 1433, you do not need to provide a port number in a connection string. If you
choose a port other than 1433 for your listener port, you will also need to explicitly specify a target port in your
connection string. Also ensure that no other services on the cluster node are using this port; otherwise this would
cause a port conflict. If one of the instances of SQL Server is listening on TCP port 1433 and there are no other
services (including additional instances of SQL Server) on the computer listening on port 1433, this will not cause a
port conflict with the availability group listener. An availability group listener can share the same TCP port inside the
same service process. Multiple instances of SQL Server should not be configured to listen on the same port.
Click on the Network mode drop-down list to select the network mode to be used by this listener. Select
DHCP if you want dynamic IPv4 address assigned by a server running Dynamic Host Configuration Protocol
(DHCP). DHCP is limited to a single subnet. DHCP is not recommended for production environments. If you
want the listener to listen on more than one subnet, use the static IP network mode option. In this chapter
we will be setting up a multi-subnet availability group, hence static IP was selected.
Next click on the Add button.
Figure 8-18 shows the IP Address window.
In the Subnet drop-down, select the first subnet and provide a new IPv4 address and click OK.
Figure 8-19 shows the Specify Replicas page with the IP Address added for the first subnet.
Next click on the Add button again to add the IP Address for the second subnet. You don’t have to do
this if all the replicas are in the same subnet.
Figure 8-20 shows the IP Address window.
110
CHAPTER 8 ■ CREATE AVAILABILITY GROUPS
This time select the second subnet in the Subnet drop-down and provide a new IPv4 Address and click OK.
Figure 8-21 shows the Specify Replicas page with both of the IP Addresses for the listener.
111
CHAPTER 8 ■ CREATE AVAILABILITY GROUPS
112
CHAPTER 8 ■ CREATE AVAILABILITY GROUPS
Select your preference for initial data synchronization of new secondary databases on this page. The
options you have is Full, Join only, and Skip initial data synchronization.
• Full – If you choose the full option, the wizard will create a full and log backup of the
primary database, create the corresponding secondary databases by restoring these
backups on every server instance that is hosting a secondary replica, and join each
secondary database to an availability group in one workflow. You will need to specify
a network share for the wizard to create and access backups.
113
CHAPTER 8 ■ CREATE AVAILABILITY GROUPS
■ Note The SQL Server service logon account on the primary replica must have read and write file-system
permissions on the network share and the SQL Server service logon account on the secondary replica must
have read permission on the network share.
• Join only - Select this option if you have already restored the primary database and its
Log On the secondary replicas. Manual restore must be done from a recent database
backup of the primary database using RESTORE WITH NORECOVERY, followed
by each subsequent log backup restore using RESTORE WITH NORECOVERY. You
will have to perform this restore sequence on every SQL Server instance that hosts
a secondary replica for the availability group. The wizard will attempt to join each
existing secondary database to the availability group.
• Skip initial data synchronization - Select this option if you want to just set up the
availability group but not join any databases. After exiting the wizard, you will have to
perform the database and log backups of every primary database, and restore them
to every SQL Server instance that hosts a secondary replica. Once the databases have
been restored, join every secondary database on every secondary replica.
After choosing the appropriate option (in this case Full), click Next.
114
CHAPTER 8 ■ CREATE AVAILABILITY GROUPS
115
CHAPTER 8 ■ CREATE AVAILABILITY GROUPS
Figure 8-25 shows Part 2 of the Summary page with values provided for the availability group creation.
116
CHAPTER 8 ■ CREATE AVAILABILITY GROUPS
117
CHAPTER 8 ■ CREATE AVAILABILITY GROUPS
T-SQL
You can also use the CREATE AVAILABILITY GROUP T-SQL commands to create and configure availability
group. You need to first create a database mirroring endpoint manually using CREATE ENDPOINT T-SQL
command. You would also need to grant CONNECT permissions to appropriate logins if SQL Server service
accounts runs under different domain accounts. You would also need to start the AlwaysOn_Health
extended event session manually using ALTER EVENT SESSION command.
There are certain availability group operations that can be performed only via TSQL as there is no GUI
support at this time. For example, configuring availability groups over a dedicated network, creating a Basic
Availability Group (BAG), enabling DTC support, enabling automatic seeding on the availability group,
creating a distributed availability group, and configuring an Always On Availability Group without Domains.
-- On AlwaysOnN3 (Replica 3)
CREATE ENDPOINT [Hadr_endpoint]
AS TCP (LISTENER_PORT = 5022, LISTENER_IP = (10.1.1.3))
FOR DATA_MIRRORING (ROLE = ALL, ENCRYPTION = REQUIRED ALGORITHM AES);
GO
118
CHAPTER 8 ■ CREATE AVAILABILITY GROUPS
■ Note SQL Server 2016 RTM only supports distributed transactions where no two databases are hosted by
the same instance. Also an existing availability group cannot be altered.
Automatic Seeding
When automatic seeding is enabled for an availability group, SQL Server automatically creates the secondary
databases for every replica in the group. That means you no longer have to manually back up and restore
secondary replicas.
If you want to enable automatic seeding for an existing availability group, then you can run the ALTER
AVAILABILITY GROUP command.
GO
■ Note Open inbound firewall rules to the mirroring endpoint port on each server as automatic seeding
communicates over the mirroring endpoint.
120
CHAPTER 8 ■ CREATE AVAILABILITY GROUPS
AVAILABILITY_MODE = SYNCHRONOUS_COMMIT,
BACKUP_PRIORITY = 50,
SECONDARY_ROLE(ALLOW_CONNECTIONS = NO),
SEEDING MODE = AUTOMATIC),
ON N'AlwaysOnN3'WITH (ENDPOINT_URL = N'TCP://AlwaysOnN3.contoso.
com:5022',
FAILOVER_MODE = AUTOMATIC,
AVAILABILITY_MODE = SYNCHRONOUS_COMMIT,
BACKUP_PRIORITY = 50,
SECONDARY_ROLE(ALLOW_CONNECTIONS = NO),
SEEDING_MODE = AUTOMATIC);
GO
■ Note Data and log file paths should be the same on every SQL Server instance participating in the
availability group. This is a requirement for automatic seeding to work. Also the database should be in full
recovery model and should have a current full and transaction log backup.
3. Next join the secondary server to the availability group and grant availability
group permission to create databases.
ALTER AVAILABILITY GROUP AGCorp JOIN
GO
ALTER AVAILABILITY GROUP AGCorp GRANT CREATE ANY DATABASE
GO
You can also prevent automatic seeding on the secondary replicas by denying secondary replica
availability group permission to create databases.
There might be scenarios where you might want to stop the automatic seeding. Run the following
command on the primary replica to switch the seeding mode to manual.
Be careful while using automatic seeding with very large databases (VLDBs). If it is added during
business hours, then the network load caused by automatic seeding could potentially impact data
synchronization performance. You might want to consider temporarily turning off automatic seeding while
VLDBs are being added to the availability group.
121
CHAPTER 8 ■ CREATE AVAILABILITY GROUPS
■ Note The LISTENER_URL option specifies the listener for each availability group along with the database
mirroring endpoint of the availability group. Hence the port number 5022 in the example is the mirroring port.
122
CHAPTER 8 ■ CREATE AVAILABILITY GROUPS
AVAILABILITY_MODE = ASYNCHRONOUS_COMMIT,
FAILOVER_MODE = MANUAL,
SEEDING_MODE = AUTOMATIC
),
'AGCorp2' WITH
(
LISTENER_URL = 'tcp://AGCorp2_listen.contoso.com:5022',
AVAILABILITY_MODE = ASYNCHRONOUS_COMMIT,
FAILOVER_MODE = MANUAL,
SEEDING_MODE = AUTOMATIC
);
GO
Following is the command to failover to the secondary
availability group:ALTER AVAILABILITY GROUP [distributedAGCorp]
FORCE_FAILOVER_ALLOW_DATA_LOSS
■ Note As distributed availability groups associates two availability groups that have their own listeners, you
could use the old database mirroring connection string syntax for transparent client redirection after failover.
123
CHAPTER 8 ■ CREATE AVAILABILITY GROUPS
■ Note To help simplify the implementation, the Always On feature Program Manager created
and published two stored procedures (CreateEndpointCert and InstallEndpointCert) in a blog article.
Consider using these two stored procedures if you need to configure a multi-domain or a domain-less
availability group. The stored procedures can be found here: https://blogs.technet.microsoft.com/
dataplatforminsider/2015/12/15/enhanced-always-on-availability-groups-in-sql-server-2016/
PowerShell
PowerShell cmdlets such as New-SqlAvailabilityReplica, New-SqlAvailabilityGroup, Join-
SqlAvailabilityGroup, and Add-SqlAvailabilityDatabase can be used for creating availability groups.
Just like in case of T-SQL, a database mirroring endpoint needs to be created (using New-SqlHadrEndPoint)
on each replica and an AlwaysOn_Health extended event session also needs to be started manually.
Summary
In this chapter we saw how to successfully configure and deploy availability groups using the availability
group wizard. In the next chapter we will take a look at the post-installation/deployment tasks.
124
CHAPTER 9
Post-Installation Tasks
After creating Always On Availability Groups in chapter 8, you are now ready to review the resources that
were created by the availability group deployment and discuss the important post-installation tasks. In this
chapter, you will review the newly created availability group, learn about different ways to connect to the
availability group, observe the current health of the availability group, review and modify the availability
group configuration, and replicate the logins and jobs to the secondary replicas.
■ Note An availability group listener allows us to connect to primary replica for read-write access without
needing to know the name of the SQL Server instance that it is connecting to. Also, when a primary replica
goes offline and a secondary replica takes the role of the primary replica, the availability group listener enables
the new connections to automatically connect to the new primary replica. The existing connections are
disconnected and the client must establish a new connection in order to work with the same database.
You will be connected to the primary replica for the availability group. To verify this, open a query
window, execute the command select @@servername, and verify that the results displayed the name of the
SQL Server instance hosting the primary replica.
If your application is using ADO.NET or SQL Native Client driver to connect to an availability
database, specify the availability group listener name followed by the availability group listener port in your
application’s connection string as shown here:
Server=tcp: AGCorp_Listen,1433;Database=AdventureWorks2016;IntegratedSecurity=SSPI
126
CHAPTER 9 ■ POST-INSTALLATION TASKS
It is recommended to use the MultiSubnetFailover connection option (if supported by your client
libraries) in the connection string for both single and multi-subnet connections to availability groups
listeners as shown below:
Server=tcp: AGCorp_Listen,1433;Database=AdventureWorks2016;IntegratedSecurity=SSPI;MultiSub
netFailover=True
■ Note By default, starting from .NET Framework 4.6.1, MultiSubnetFailover property is set to true.
Although the MultiSubnetFailover connection option is not required, it provides faster subnet
failover because the client driver attempts to open up a TCP socket for each IP address associated with
availability group in parallel. The client driver waits for the first IP to respond with success and uses it for
connection. Using the MultiSubnetFailover connection option in a single subnet configuration allows
you to preconfigure new clients to support future spanning of subnets without any need for future client
connections string changes.
■ Note In a multi-subnet cluster setup, if your application is using a legacy client that does not support the
MultiSubnetFailover property, then consider setting the listener’s RegisterAllProvidersIP value to 0 to
avoid intermittent timeout issues.
127
CHAPTER 9 ■ POST-INSTALLATION TASKS
You will see all the availability groups running on this SQL Server instance. Figure 9-2 shows only one
availability group named AGCorp (Primary) as you created only one availability group in chapter 8. Notice
it has the availability replica role in parentheses after the availability group name. Expand the availability
group to see availability replicas, availability databases, and availability group listeners. Expand Availability
Replicas to see all the SQL Server instances participating in the availability group. In Figure 9-2, there are
three availability replicas: AlwaysOnN1 is the current primary replica, and AlwaysOnN2 and AlwaysOnN3
are the secondary replicas. Expand Availability Databases, and you will see the availability databases that
are being replicated to the replicas. Figure 9-2 shows two availability databases AdventureWorks2014 and
AdventureWorksDW2014. These are the two databases that you selected while creating availability groups in
chapter 8. Expand Availability Group Listeners to see the availability group listener. Also, if you expand the
Databases node you will see the two availability databases along with their synchronization state.
A client connection can use the SQL Server instance name directly instead of using the availability
group listener name. SQL Server instance does not care if a client connects using the availability group
128
CHAPTER 9 ■ POST-INSTALLATION TASKS
listener name or another instance endpoint. SQL Server instance verifies the state of the targeted database
and either allows or disallows connections based on the availability group configuration. Here is a sample
connection string to connect directly using the SQL Server instance name (AlwaysOnN1):
Server=tcp: AlwaysOnN1,1433;Database=AdventureWorks2016;IntegratedSecurity=SSPI
■ Note If you choose to connect directly using the SQL Server instance name and not the availability group
listener name, you will lose the benefit of new connections being directed automatically to the current primary
replica. Also, you will lose the benefit of read-only routing, which we will discuss in chapter 10.
■ Note If you open the availability group properties from a secondary replica, you will only be able to review
the properties and will not be able to modify them.
In the availability group properties window, most of the columns are self-explanatory. Notice the
column called Session Timeout (seconds). The session timeout (default value is 10 seconds) controls how
long an availability replica waits for a ping response from a connected replica before considering the
connection to have failed. If a replica receives a ping within the session timeout period, then it means
129
CHAPTER 9 ■ POST-INSTALLATION TASKS
that the connection is still active and the replicas are still communicating. If no ping is received within the
session timeout period, the replica times out. Its connection is closed and the timed-out replica enters the
DISCONNECTED state. This replica property applies only to the connection between a given secondary
replica and the primary replica of the availability group.
■ Note The default session timeout period of 10 seconds works well for most of the environment. It is
recommended to set this value to 10 seconds or greater. If you set this value to less than 10 seconds, you have
a possibility of declaring a false failure when the replica is busy.
3. This will open the Always On Dashboard on the SQL Server instance. The
dashboard will show you the name of all the availability groups configured on
the SQL Server instance, the primary instance name, the failover mode, and
issues (if any) as shown in Figure 9-6. Figure 9-2 displays one availability group
named AGCorp hosted on primary instance AlwaysOnN1 and configured with
“Automatic” failover mode. This is the availability group that you created in
chapter 8. If you had multiple availability groups on this instance, then you
would have seen them over here too.
4. To see the availability group details, click the availability group name. Alternately,
you can see the availability group details from the Object Explorer by expanding
the Always On High Availability node, expanding Availability Groups node,
right-clicking the availability group name, and then clicking Show Dashboard as
shown in Figure 9-7.
132
CHAPTER 9 ■ POST-INSTALLATION TASKS
5. This will open the availability group details dashboard as shown in Figure 9-8.
133
CHAPTER 9 ■ POST-INSTALLATION TASKS
Figure 9-8. Sample availability group details dashboard launched from the primary replica
■ Note You can group the availability group information by clicking Group by and selecting one of the
following: Availability replicas, Availability databases, Synchronization state, Failover readiness, Issues or None.
134
CHAPTER 9 ■ POST-INSTALLATION TASKS
Apart from the above three main sections, the dashboard has three links on the top right-hand corner:
• Start Failover Wizard – As the name suggests, this link invokes the Failover Availability
Group Wizard. This wizard allows us to perform a planned failover. It allows us to failover
the availability group to a new secondary replica, making it the new primary replica.
• View Always On Health Events – This link allows us to view the Extended Events
captured via the AlwaysOn_Health session. A sample Always On Health Events
screen is shown in Figure 9-9. In this screen, you can view the recent availability
group activity and state changes and events. You can add columns by right-clicking
the column headers and then selecting what you need. Also, when you click a
specific row, you can see the details of the rows below.
• View Cluster Quorum Information – This link displays the Windows Cluster Quorum
information as shown in Figure 9-10. This screen tells us the Windows Server
Failover Cluster (WSFC) name, Quorum model, member name, member type,
member state, and vote count. By default, all cluster members can vote and have
a vote count of 1. Administrators can change the vote functionality in the quorum
model by configuring a node to have 0 votes. For example, say you have a node
in a disaster recovery datacenter and you have poor network connectivity to that
datacenter. To ensure the node in the DR datacenter does not participate in the
quorum voting, you can configure its vote to 0 using the Failover Cluster Manager or
PowerShell.
So far we have discussed the default view of the Always On Dashboard. The dashboard is highly
configurable, and there are many settings that you can add to the view to provide you with even more
information on the current state of the availability group. You can add additional columns by right-clicking
the column headings or by clicking the Add/Remove Columns. Figure 9-11 shows the additional column for
the Availability Replica Health section.
136
CHAPTER 9 ■ POST-INSTALLATION TASKS
For example, one column that you may want to add is the Availability mode column. This tells us if
the availability mode is synchronous-commit or asynchronous-commit mode. Or you may want to add the
Secondary Connection Mode column. This tells us if the secondary is connected or not.
■ Note The dashboard remembers the columns that you add/remove manually from the Add/Remove
Columns list. Check it out by adding some columns manually and then close and reopen the dashboard. You
should see all the columns that you had added manually. If you want to revert back to the default columns, then
select Reset view.
Figure 9-12 shows the additional column for the Availability Database Health section.
137
CHAPTER 9 ■ POST-INSTALLATION TASKS
Say your boss comes up to you and asks you, “If we lose our primary replica or databases, how much
data will we lose and how much time will it take us to recover?” Or maybe, you are asked, “If the primary
replica goes down, how many seconds or minutes’ worth of data will we lose at this time?” Or if the primary
replica goes down and automatic failover happens, how much time will the secondary replica take to finish
the recovery and make the databases available for use? Without writing a single line of code and almost
instantly you can find the answer and impress your boss, by opening the availability group dashboard and
manually adding the Estimated Recovery Time (seconds) and Estimated Data Loss (time) columns. You will
be able to see this information for every individual availability database for every replica. So you can quickly
provide the information to your boss. Most of the columns are self-explanatory but do not worry too much if
you are not familiar with them. We will cover the most important columns in chapter 15.
It is important to note that the dashboard shows a different view based on where you launch it. The
dashboard shown in Figure 9-8 was launched from the primary replica. From this view, you can monitor the
health of the whole availability group. If you launch the dashboard from a secondary replica, then the view
only shows the health of that secondary replica. Figure 9-13 shows the dashboard when launched from the
secondary replica AlwaysOnN2. As can be seen in Figure 9-13, you are only getting the information from the
perspective of the secondary replica AlwaysOnN2. Even if you open the dashboard from a secondary replica,
the dashboard does show which SQL Server instance is our primary instance. For example, Figure 9-13
shows that AlwaysOnN1 is the primary instance. If you click AlwaysOnN1 on the dashboard, it will show you
the full dashboard as in Figure 9-8.
138
CHAPTER 9 ■ POST-INSTALLATION TASKS
Figure 9-13. Sample availability group details dashboard launched from the secondary replica
■ Tip To ensure that you launch the dashboard from the primary replica, connect to SQL Server Management
Studio using the availability group listener name instead and then invoke the dashboard.
139
CHAPTER 9 ■ POST-INSTALLATION TASKS
2. In the left-hand pane of the Options screen, expand SQL Server Always On, and
then expand Dashboard.
3. As can be seen in Figure 9-14, by default Turn on automatic refresh is selected
and the default refresh interval is 30 seconds. To change the default settings,
enter the new refresh interval in seconds and optionally change the number of
times you want to retry the connection. Typically, I change the refresh interval
from 30 seconds to 10 seconds or lower when I am monitoring the availability
groups. If you want to enable user-defined Always On policy, select Enable
user-defined Always On policy. Click OK to apply the changes.
■ Note The Dashboard options are client-specific configuration options and will only apply to the current
user’s SQL Server Management Studio.
140
CHAPTER 9 ■ POST-INSTALLATION TASKS
Notice that the name AGCorp_Listen that you typed while creating availability group listener in
chapter 8 has been added to the AGCorp role as a cluster resource. Expand AGCorp_Listen resource to see
the dependent IP address resources that you had specified during the availability group listener creation.
Notice that the IP address 20.1.1.105 is offline and 10.1.1.105 is online because at any point of time only one
of the IP address resources can be online. Currently, AGCorp is owned by AlwaysOnN1 node residing in the
primary datacenter and hence the IP address 10.1.1.105 is online. If AGCorp had failed over to the disaster
recovery datacenter and was owned by AlwaysOnN3, then the IP address 20.1.1.105 will become online and
10.1.1.105 will become offline. In case all the replicas are in the same subnet, only one IP Address resource
will exist and will be online all the time.
Right-click the SQL Server Availability Group Resource AGCorp, select More Actions, and then select
Show Dependency Report to see the dependency report for the availability group as shown in Figure 9-16.
141
CHAPTER 9 ■ POST-INSTALLATION TASKS
■ Note The flexible failover policy of an availability group can be modified from Failover Cluster Manager or
PowerShell or Transact-SQL.
142
CHAPTER 9 ■ POST-INSTALLATION TASKS
The availability group resource DLL performs a health check of the primary replica by calling the
sp_server_diagnostics stored procedure with a repeat interval. It captures diagnostic data and health
information of the primary replica and returns results at an interval that equals one-third of the HealthCheck
Timeout period. The sp_server_diagnostics stored procedure collects the data on the following five
components: system, resource, query_processing, io_subsystem, and events. It also provides the health state
of each component as clean, warning, error, or unknown. Additionally, on the primary replica, sp_server_
diagnostics displays the availability group name and whether the lease is valid or not for each availability
group. Table 9-1 has the description of the five components:
Components Description
system Collects data from a systems perspective on spinlocks, severe processing conditions,
non-yielding tasks, page faults, and CPU usage.
resource Collects data from a resource perspective on physical and virtual memory, buffer
pools, pages, cache, and other memory objects.
query_processing Collects data from a query-processing perspective on the worker threads, tasks, wait
types, CPU intensive sessions, and blocking tasks.
io_subsystem Collects data on IO subsystem.
events Collects data on the errors and event of interest recorded by the server, including
details about out of memory, scheduler monitor, buffer pool spinlocks, security,
connectivity, ring buffer exceptions, and ring buffer events about memory broker.
■ Note Even though sp_server_diagnostics stored procedure collects data for five components, only
the data for system, resource and query_processing are currently being used for failure detection whereas
io_subsystem and events data are available for diagnostic purposes only.
Based on the issues reported in the result sets of the sp_server_diagnostics stored procedure
and FailureConditionLevel, the resource DLL decides whether the primary replica is healthy or not. The
availability group resource DLL responds back to the WSFC cluster. The WSFC cluster then initiates an
automatic failover to the secondary replica.
When an availability group failover occurs, existing connections to the primary replica are terminated,
and the client applications must establish a new connection in order to continue working. While a failover
is occurring, connectivity to the availability group databases will fail, forcing the client applications to retry
connecting until the primary is fully online and ready to accept connections. If the availability group comes
back online during the client’s connection attempt but before the connection timeout period, the client may
successfully connect during one of the retry attempts and no error will be reported to the application.
As seen in Figure 9-17, the default HealthCheck Timeout is 30 seconds. Hence, the sp_server_
diagnostics stored procedure returns results at a 10-second interval. If the primary replica is unresponsive
and not a single result set has been received by the resource DLL during the 30 second period, the server is
considered to be hung.
The FailureConditionLevel determines whether the diagnostic information returned by sp_server_
diagnostics warrants an automatic failover. The FailureConditionLevel specifies what failures trigger an
automatic failure. There are five failure conditions level 1 through 5. The checks are accumulative with 1
being the least restrictive and 5 being the most restrictive. The checks that are done in level 5 are all checks
of the lower levels plus the checks of level 5. Table 9-2 shows the FailureConditionLevel and its description,
143
CHAPTER 9 ■ POST-INSTALLATION TASKS
The default failover condition level is 3, which means that it checks if the SQL Server service is up,
whether the sp_server_diagnostics stored procedure responds within the health check timeout, and
whether there are critical server errors reported in the result set.
■ Note The failure condition level and sp_server_diagnostics only monitors the health of the SQL Server
instance and not the health of the database. However, starting from SQL Server 2016, if you selected Database
Level Heath Detection while creating the availability group (or added the detection after the creation of
availability group), then an automatic failover is initiated when the database status on the primary replica is
anything other than ONLINE. If you did not select Database Level Heath Detection, then only the health of the
instance is used to trigger automatic failover.
Figure 9-17 has another property called the Lease Timeout. The Lease Timeout is set to a default value of
20,000 ms or 20 secs. The Lease Timeout is used to avoid split brain scenarios in availability group. A split brain
scenario occurs if more than one availability replica considers itself to be the primary replica, tries to bring
the availability group online and thereby accepting queries and responding to queries. This typically happens
when the cluster nodes are disconnected. To understand this better, let’s consider a scenario where in the SQL
Server instance running the primary replica of an availability group is detected unresponsive and a failover is
144
CHAPTER 9 ■ POST-INSTALLATION TASKS
initiated. However, the resource DLL is not able to connect to the unresponsive SQL Server instance anymore.
Therefore, the SQL Server instance would think that it is still the primary replica and will accept queries and
respond to queries. To avoid such a scenario, the SQL Server instance running the primary replica needs to get
a lease from the resource DLL before the lease timeout threshold hits. If the lease is not renewed within that
threshold, the primary replica is removed from the primary and a failover to a secondary is initiated.
The default time period is 6 hours, and the maximum failures supported is N-1, where N is the number
of cluster nodes. Since we have a 3-node WSFC, maximum failures allowed is (3 - 1) = 2 in 6 hours. This
means that if you are testing automatic failover on our 3-node cluster, you will be able to perform automatic
failover only twice in 6 hours. The third time, the availability group will remain in a failed state until a cluster
administrator manually brings the failed availability group resource online or a DBA performs a manual
failover of the availability group.
Replicating Logins
The steps to replicate the logins on all replicas are still the same as you used to transfer logins from one SQL
Server to another SQL Server. There are various methods to replicate the logins but one of the best methods
is to use the T-SQL script in Microsoft Knowledge Base article titled “How to transfer logins and passwords
between instances of SQL Server” available at https://support.microsoft.com/en-us/kb/918992. Here
are the high-level steps that you need to follow for availability groups:
1. On the primary replica, run the script in step 2 of “Method 3: Create a log in
script that has a blank password” in the above Microsoft Knowledge Base article.
2. This script will create two stored procedures in the master database. The stored
procedures are named sp_help_revlogin and sp_hexadecimal.
3. On the primary replica, execute the stored procedure sp_help_revlogin. This will
create the login script.
4. Review the login script that was generated in step 3 and execute it against all the
secondary replicas.
5. The login script creates the logins that have the original Security Identifier (SID)
and the original password.
Another method to replicate the logins to secondary replicas is to use the SQL Server Integration
Services (SSIS) – Transfer Logins Task. The Transfer Logins Task can be configured to transfer all the logins,
only specified logins, or all the logins that have access to specified databases only.
■ Note If you are using the Transfer Logins Task, the transferred logins are disabled and assigned random
passwords at the secondary replicas. A database administrator (DBA) who has membership in the sysadmin role
on the secondary replicas need to change the password and enable the logins before you can use it.
146
CHAPTER 9 ■ POST-INSTALLATION TASKS
If you only have few logins, you could create the logins on the secondary replicas using the T-SQL
CREATE LOGIN command.
■ Note Partially contained databases do not store the jobs. So you still need to replicate the jobs to all the
secondary replicas.
■ Note Backup jobs created using the built-in Maintenance Plan Wizard and Log Shipping Wizard
automatically calls and checks the sys.fn_hadr_backup_is_preferred_replica function.
147
CHAPTER 9 ■ POST-INSTALLATION TASKS
■ Note SQL Server Integration Services (SSIS) – Transfer Jobs Task can also be used to transfer one or more
SQL Server Agent jobs between the SQL Server instances.
Summary
In this chapter, we covered the post-installation tasks and ensured that everything was configured properly.
Now you are ready to explore the benefits of availability groups. In the next chapter, we will discuss about
Always On Availability Groups Readable Secondary feature that allows read-only access to all the secondary
databases. We will learn the benefits, configuration, limitations and performance considerations of using
readable secondaries.
148
PART IV
Now that we have created an Availability Group and made sure it’s configured correctly, let’s take a look at
how to fully leverage the features provided by this technology within SQL Server. The secondary replicas
participating in the Availability Group maintain redundant copies of the production data on the primary
replica. Wouldn’t it be nice to leverage the redundant copies and the processing power on the secondary
replicas to offload some of the production workload? The readable secondary feature within Always On
Availability Groups allows you to do that.
It allows you to leverage the investment in highly available hardware by offloading read workload such
as reporting to one or more secondary replicas. Offloading reporting workload to secondary replicas frees
up resources on the primary replica to achieve higher throughput, at the same time allowing resources on
secondary replicas for reporting workload to deliver higher performance.
Whenever we are overloaded with work, we usually wish we could clone ourselves and offload some of
our work to our clones. We would be more effective and would be able to focus on the important tasks rather
than trying to do everything ourselves. But this is just wishful thinking.
The world is becoming a 24/7 global marketplace. Sharing same servers/databases for the primary
and reporting workloads takes CPU cycles and IO bandwidth away from the main workload. Sharing of
workloads can also cause contention, impacting the overall performance and availability of the applications.
In addition, server and database maintenance is critical for maintaining the health of the database servers
and to protect the data from unexpected failures. But due to the 24/7 global marketplace, maintenance
windows are shrinking day-by-day, making it difficult for administrators to perform maintenance without
impacting performance and availability.
Figure 10-1. Offload reporting workload options prior to SQL Server 2012
152
CHAPTER 10 ■ READABLE SECONDARY REPLICAS
■ Note SQL Server 2012 supports one primary replica and up to four secondary replicas. SQL Server 2014
and 2016 support one primary replica and up to eight secondary replicas.
As shown in Figure 10-2, you can configure multiple secondary replicas, each of which can be used
for read-only workload and/or backups and/or DBCC integrity checks. The secondary can be either in
synchronous-commit mode or asynchronous-commit mode for using it as an Active Secondary.
■ Note We cover the Secondary Replica on Microsoft Azure later in chapter 18.
153
CHAPTER 10 ■ READABLE SECONDARY REPLICAS
You can change the connection access for the secondary role by selecting Yes, No, or Read-intent only
options, as shown in the diagram. We had discussed these options in detail in the chapter 8 earlier.
Using Transact-SQL
You can also configure the connection access for the secondary role by using CREATE/ALTER AVAILABILITY
GROUP commands. The SECONDARY_ROLE option in the ADD/MODIFY REPLICA clause will take NO, ALL, or
READ_ONLY options, which equates to No, Yes, or Read-intent only options respectively.
Figure 10-4 shows the TSQL command to configure connection access for the secondary role.
154
CHAPTER 10 ■ READABLE SECONDARY REPLICAS
Using PowerShell
If creating a new replica, use the New-SqlAvailabilityReplica cmdet or when modifying an existing replica
use the Set-SqlAvailabilityReplica cmdlet. To configure connection access for the secondary role, specify the
ConnectionModeInSecondaryRole secondary_role_keyword parameter, where secondary_role_keyword
equals one of the following values:
AllowNoConnections, AllowReadIntentConnectionsOnly, or AllowAllConnections
Figure 10-5 shows the PowerShell command to configure connection access for the secondary role.
Server=AGCorp,1433;Database=AdventureWorks2016;IntegratedSecurity=SSPI;ApplicationIntent=Read
Only
155
CHAPTER 10 ■ READABLE SECONDARY REPLICAS
Figure 10-6 shows the output from the preceding query. As you see, there is no read-only routing
currently set up for any of the replicas.
■ Note There is no GUI available in SSMS to accomplish this task; it can only be accomplished through
ALTER AVAILABILITY GROUP DDL scripts.
The first step is to configure the read-only routing URL for each replica. Read-only routing URL is used
for routing read-intent connection requests to a specific readable secondary replica. A routing URL consists
of a system address for the replica and the port number that is used by the database engine of that instance.
For example, the routing URL for AlwaysOnN1 would be tcp://alwaysonn1:1433. Note that since we are
specifying the port number used by the database engine, we do not specify the name of the SQL Server
instance, only the server name. Usually, every readable secondary replica is assigned a read-only routing
URL. To configure read-only routing URL, connect to the primary replica in the script window and execute
the following code. In the below example we are configuring the routing URL for all the replicas in the
availability group.
USE MASTER
GO
■ Note You must set the read-only routing URL before configuring the read-only routing list.
156
CHAPTER 10 ■ READABLE SECONDARY REPLICAS
USE MASTER
GO
■ Note Prior to SQL Server 2016, read-only routing always directed traffic to the first available replica in
the routing list, unless it was not accessible, and then it would direct the connection to the next replica in the
routing list. It wasn’t possible to spread out the load across replicas, even when you had multiple secondaries.
USE MASTER
GO
-- Configure routing list on current primary replica
ALTER AVAILABILITY GROUP AGCorp
MODIFY REPLICA ON 'AlwaysOnN1' WITH
(primary_role(read_only_routing_list=(('AlwaysOnN2','AlwaysOnN3'),'AlwaysOnN1')))
The nested parentheses that surround the servers identify the load-balanced set.
Basically there are two lists in the preceding example:
List 1: AlwaysOnN2 and AlwaysOnN3
List 2: AlwaysOn1
Routing logic
The read-only connections will be routed to the replicas in the first list AlwaysOnN2 and AlwaysOnN3.
There will be a round-robin distribution of read-only connections between the two replicas in the first
list. In the preceding example, the first incoming read-only connection will be routed to AlwaysOnN2, the
second read-only connection will be routed to AlwaysOnN3, the third read-only connection will be routed to
AlwaysOnN2, the fourth read-only connection will be routed to AlwaysOnN3, and so on.
157
CHAPTER 10 ■ READABLE SECONDARY REPLICAS
If a replica becomes unavailable, routing will continue with remaining replicas in the first list. In the
preceding example, if AlwaysOnN2 becomes unavailable, then the read-only connections will only be routed
to AlwaysOnN3.No read-only connections will be routed to the primary (AlwaysOnN1), if there is at least one
readable secondary replica is accessible in the routing list. If all replicas in the first list are unavailable, then
the read-only connections are routed to the replicas in the next list. In the preceding example, the read-only
connections will be routed to AlwaysOnN1, If AlwaysOnN2 and AlwaysOnN3 become unavailable.
If any replicas in the first list become available again, a read-only connection once again will start
routing to those replicas as they have higher priority.
Make sure you configure the routing list on all the replicas, as they would assume the primary role after
failover.
USE MASTER
GO
Now execute the following query again to check if a routing list has been configured correctly for your
Availability Group:
Figure 10-7 shows the output from the preceding query. Review the secondary_route_reader_server
and routing_priority column values.
158
CHAPTER 10 ■ READABLE SECONDARY REPLICAS
The read operations do not take shared locks because of row versioning; however they do take schema
stability (Sch-S) locks. Redo operations applying DDL changes can get blocked due to this. Hence monitor
your secondary replica for resource contention issues and provision adequate resources on the secondary
replica. Extended events can be used to monitor REDO thread blocking, sqlserver.lock_redo_blocked
event will be raised if the REDO thread is blocked. Ensure that you are using best practices while building
the reporting queries to minimize resource consumption. Also consider running or scheduling the resource
intensive queries during low-latency periods.
159
CHAPTER 10 ■ READABLE SECONDARY REPLICAS
Query Performance
Statistics on the primary databases are available on the secondary databases. However, if the read-only
secondary database needs statistics that are different from the ones on the primary database then those
statistics cannot be manually created as the database is read-only. In this case, SQL Server creates temporary
statistics in the TempDB on the secondary replica. These temporary statistics have readonly_database_
statistic appended to its name. They are also generated if the permanent statistics are stale. If the
permanent statistics are updated on the primary, then they are copied over to the secondary and SQL Server
uses the updated permanent statistics rather than the temporary statistics. Temporary statistics are dropped
if the availability group fails over or if the SQL Server is restarted.
Indexes required for the read-only workload on the secondary need to be created on the primary
database.
Keep these indexing and statistics considerations in mind while planning the read-only workload or
when you are troubleshooting any read-only workload performance issues on the Readable Secondary.
Data Latency
When the log records are sent from the primary replica to the secondary replicas, a dedicated REDO
thread applies those log records on each secondary database. If you query the secondary database, the
committed data is only available once the REDO thread has applied the logs to the database. In this case,
the Always On dashboard would show zero data loss under the estimated data loss column, which is an
accurate representation as the data is committed. However, a common misconception is that the data on
the secondary replica when queried will be up-to-date if the dashboard shows zero data loss. This is not the
160
CHAPTER 10 ■ READABLE SECONDARY REPLICAS
case. Hence offloading your reporting workloads to the Readable Secondary is only useful if the application
can tolerate some amount of data latency. If not, then it might be better to run the reporting queries against
the primary replica. The latency is usually only a matter of seconds, between the primary and secondary
replicas. In some cases, if there is resource contention then the latency can increase.
In the case of In-Memory OLTP tables, when accessing these tables on secondary replica, a safe
timestamp is used to return rows from transactions that have committed earlier than a safe timestamp.
Garbage collection uses this as the oldest timestamp hint to garbage collect the rows on the primary replica.
If transactional activity on the primary replica stops before the internal threshold for the safe-timestamp
update is reached, the changes made since the last update to the safe timestamp will not be visible on the
secondary replica. You may need to run a few DML transactions on a dummy durable memory-optimized
table on the primary replica to work around this issue. Or run a manual checkpoint to force shipping of a
safe timestamp, although this is not recommended.
161
CHAPTER 10 ■ READABLE SECONDARY REPLICAS
Resource Governance
To manage the CPU, Memory, and I/O resources used by connections from Reporting Workload, use the
Resource Governor feature. This will ensure that enough resources are available for the REDO thread and
hence reduce contention between the REDO thread and Reporting Workload.
Summary
In this chapter we saw how to successfully configure Readable Secondary and some considerations
and limitations of using a Readable Secondary. In the next chapter we will take a look at the database
maintenance operations using Readable Secondary.
162
CHAPTER 11
Now that you have deployed your high availability and disaster recovery (HADR) solution using Always On
Availability Groups, it’s important to ensure that you are maximizing the investment made in the solution by
fully utilizing all the replicas. Unfortunately, we still find database administrators (DBAs) using the primary
replica for almost everything and rarely using the secondary replicas for production, reporting and/or
database maintenance workloads. It’s very common to see DBAs using the secondary replicas only for DR
just like they did with older HADR solutions. Most of the older solutions did not allow use of the secondary
replicas for anything except for DR and one of the asks from customers was the ability to fully utilize the
secondaries.
In chapter 10, you saw how you can leverage the investment in the hardware by offloading reporting
workload to one or more secondary replicas thereby freeing up resources on the primary replica to achieve
higher throughput. In this chapter you will see how you can further free up resources on the primary replica
by offloading resource intensive jobs like database backup and integrity check to the secondary replicas.
■ Note A copy-only full backup is an SQL Server backup that is independent of the sequence of regular
SQL Server backups. They do not impact the log chain or clear the differential bitmap. Copy-only full backups
cannot be used as a base for differential backups. However, copy-only full backup can still be used as a base
for subsequent transaction log backups.
When a secondary replica starts a backup, it notifies the primary replica that it wants to start a backup.
The primary replica will attempt to take a BulkOp lock for that database. The BulkOp lock prevents two
replicas to take backup at the same time. Assuming no other replica is taking a backup, primary replica will
get the BulkOp lock on the database on behalf of the secondary replica and notify the secondary replica.
The secondary replica will start performing the backup locally. It is going to copy data out of the secondary
database or transaction log depending upon the type of backup to the backup media. There is no interaction
for backup purposes with the primary replica during this stage.
After the secondary replica completes the backup, it notifies the primary replica and gives it the last log
sequence number (LSN) that was active in the log at the point it stopped copying data from the transaction
log. The primary replica updates the last LSN in its copy of the database, which is the updatable copy, and
that’s the one that is used for the next backup. The update of the last backup LSN now gets replicated to
all the secondary replicas. At this point, the primary replica also releases the BulkOp lock so other backup
operations can take place. Also, now each replica can perform log truncation based on this last backup LSN.
165
CHAPTER 11 ■ DATABASE MAINTENANCE USING SECONDARY REPLICAS
Let’s assume that currently AlwaysOnN1 is the primary replica and AlwaysOnN2 and AlwaysOnN3 are
the secondary replicas. AlwaysOnN1 and AlwaysOnN2 are configured with synchronous availability mode
whereas AlwaysOnN3 is configured as asynchronous availability mode. To minimize the impact of the
backups on the primary, you want to take backups on the secondary replica preferably AlwaysOnN2 as it is
in the same datacenter and is in synchronous availability mode.
The first transaction log backup is taken on secondary replica AlwaysOnN2. Let’s assume that this log
backup (Log Backup1) includes transactions with log sequence numbers (LSNs) from 1 to 10. The next
transaction log backup (Log Backup2) occurs on AlwaysOnN2. This backup will include any transactions
with LSN after the last LSN from Log Backup1 taken on AlwaysOnN2. Let’s assume this second log backup
includes transactions with LSNs 11 to 24.
Let’s assume that AlwaysOnN2 is down for some reason and the next log backup (Log Backup3) is
taken on our secondary replica AlwaysOnN3 that is located in the DR datacenter. Log Backup3 will have
LSNs from 25 onwards. Let’s assume the third log backup includes transactions with LSNs 25 to 32. Now say,
AlwaysOnN3 also goes down and you take the next transaction log backup (Log Backup 4) on the primary
replica AlwaysOnN1, this backup will have LSNs from 33 onwards. Let’s say it has LSNs 33 to 40.
By this time, let’s assume AlwaysOnN2 and AlwaysOnN3 are back online. And you resume taking
transaction log backup on AlwaysOnN2. The fifth log backup will include transactions from 41 onwards. As
you may have noticed, there are no overlaps of LSNs between the log backups. If you need to recover the
database using the backups, you will need to restore a full backup (or a copy-only full backup) followed by all
the transaction log backups taken from all the replicas and restored in order as if they were all taken from a
single SQL Server.
■ Note You can configure the backup preferences and priorities either during creation of the availability
group as you saw in chapter 8 or after creating the availability group. After configuring the backup preferences
and priorities, you need to script backup jobs (discussed later in this chapter) to account for them as there is
no automatic enforcement of these settings. Ad-hoc backup jobs do not check these preferences by default.
However, backup jobs created using default SQL Server maintenance plan and Log Shipping checks the backup
settings that you configure.
Using SSMS
To configure or review the backup preferences and priority using SSMS, open Availability Group Properties
and select the Backup Preferences page as shown in Figure 11-3. This page is very similar to the Backup
Preferences tab that you saw in chapter 8 while creating the availability group.
166
CHAPTER 11 ■ DATABASE MAINTENANCE USING SECONDARY REPLICAS
As shown in Figure 11-3, you have the following options for backup preference
• Prefer Secondary – This is the default backup preference. As the name suggests, when
this option is selected, backups will occur on a secondary replica except when the
primary replica is the only replica online. When there are multiple secondary replicas
available, backup priority defines the secondary with highest priority for backups.
• Secondary only – If you do not want backups to occur on the primary replica then
select this option. When this option is selected, backups always occur on the
secondary and will not occur if primary is the only replica online.
• Primary – As the name suggests, when this option is selected, backups always occur
on the primary replica. You want to choose this option if differential backups are to
be taken as differential backups are not supported on secondary replicas.
• Any Replica – When this option is selected, backups can occur on any replica;
however backup priority will still be applicable.
167
CHAPTER 11 ■ DATABASE MAINTENANCE USING SECONDARY REPLICAS
Now let’s discuss how to handle backups when there are multiple secondary replicas. For example,
the availability group that you created in chapter 8 has two secondary replicas. If you select the option
Prefer Secondary, which replica will be selected for the backup operation? To specify where the backup
jobs should be performed relative to other replicas in the same availability group, you will set the backup
priorities in the Replica backup priorities grid in the Backup Preferences page. For the backup priority,
you can select any integer between 1 and 100. One has the lowest priority, and 100 has the highest
priority. If the secondary replicas have the same priority, then the preferred replica is selected by ordering
the replicas by their name in ascending order.
To exclude a replica, you can check the Exclude Replica box for that replica. For example, say you
have a replica in the DR datacenter and you want to use it only for DR purposes and not for any other
load. Or maybe the network connectivity to a secondary replica is not reliable and you do not want to
use that replica for backups. In that case you can exclude the replica. In Figure 11-3, we have given the
primary replica AlwaysOnN1 a backup priority of 20, secondary replica AlwaysOnN2 a backup priority
of 30, and AlwaysOnN3 a backup priority of 10. We could have given them backup priorities of 2, 3, and 1
respectively too and it would behave similarly. With these priorities, it means that our preferred backup
replica is AlwaysOnN2 as it has the highest backup priority followed by AlwaysOnN1 and AlwaysOnN3 has
the lowest priority.
■ Note Even though AlwaysOnN1 (primary) has a higher priority than AlwaysOnN3 (secondary), if
AlwaysOnN2 (secondary) is unavailable then AlwaysOnN3 will become the preferred backup replica as the
backup preference is set to Prefer Secondary.
Using TSQL
To configure backup preference and priority using TSQL set the WITH AUTOMATED_BACKUP_PREFERENCE option
and BACKUP_PRIORITY option respectively of the CREATE AVAILABILITY GROUP or ALTER AVAILABILITY
GROUP TSQL. Below is a sample code to configure backup preference to prefer secondary.
USE [master]
GO
ALTER AVAILABILITY GROUP [AGCorp] SET(AUTOMATED_BACKUP_PREFERENCE = SECONDARY);
GO
The following is a sample code to configure backup priority of 20 for AlwaysOnN1 replica, 30 for
AlwaysOnN2 replica and 10 for AlwaysOnN3 replica:
168
CHAPTER 11 ■ DATABASE MAINTENANCE USING SECONDARY REPLICAS
Using PowerShell
To configure backup preference and priority using PowerShell, use the AutomatedBackupPreference option
and BackupPriority options respectively of the Set-SqlAvailabilityGroup cmdlet. Below is a sample code
to configure backup preference to prefer secondary.
The following is a sample code to configure backup priority of 20 for AlwaysOnN1 replica, 30 for
AlwaysOnN2 replica, and 10 for AlwaysOnN3 replica.
If you use the default SQL Server Maintenance Plan for your backups, then you will not need to do
anything extra as the maintenance plan uses this function by default to decide if the backups need to occur
on a particular replica. If you do not use maintenance plans and have custom backup jobs, then you will
need to modify the backup job using something similar to the sample code shown earlier.
To ensure that the backups will occur on the current preferred backup replica at the time of starting the
backup job, you need to create the same jobs (using maintenance plan for backups or custom backup jobs)
with the same schedule on all replicas. Say, for example, you configure the backup preference and priority
as shown in Figure 11-3 and you want to take transaction log backups every 15 minutes as per these backup
configuration settings. You can create the automated transaction log backup job very easily by creating a
database maintenance plan wizard on each replica to take log backups every 15 minutes with the same
schedule. The jobs will run on each replica at the same time but the actual backup will occur only on the
replica that is the current preferred backup replica. The jobs on non-preferred replicas will do nothing and
exit. The same logic applies for all supported backup types.
If you are creating a database maintenance plan to take full backups, you will see the warning message
shown in Figure 11-4.
169
CHAPTER 11 ■ DATABASE MAINTENANCE USING SECONDARY REPLICAS
Figure 11-4. Warning message while creating database maintenance plan for full backup
This is expected as regular full backups are not supported on a secondary replica. To clear the warning,
click the Options tab and select Copy-only backup check box as shown in Figure 11-5.
170
CHAPTER 11 ■ DATABASE MAINTENANCE USING SECONDARY REPLICAS
Also, notice the check box labelled For availability databases, ignore replica priority for backup and
backup on primary settings. This option allows us to override the backup preference settings if we need to.
To see how the database maintenance plan uses the system function sys.fn_hadr_backup_is_the
preferred_replica, right-click the maintenance plan, and select Modify. In the Design Maintenance Plan
page, right-click Backup Database Task (Full), select Edit, and click the View T-SQL button in the Back Up
Database Task page. Notice the function used in the backup script, as shown in Figure 11-6.
171
CHAPTER 11 ■ DATABASE MAINTENANCE USING SECONDARY REPLICAS
Figure 11-6. Ensuring that the maintenance plan uses the system function sys.fn_hadr_backup_is_
preferred_replica
Best Practices
Having a good backup strategy in place helps us to restore the databases from any failure in a very efficient
way. As you might be aware, in order to restore a database from a set of backups, you need to create a
recovery sequence for a given point of time. This can be complicated especially since you have different
types of backups. Also, since now you can take backups on any replica, you can have a bigger problem when
you need to restore all the backups in the same order it was taken. If you miss a backup, then you will break
the backup chain.
■ Note MSDB database is only aware of the backups taken on the local SQL Server instance. It does not
know about the backups that are taken on the other replicas.
To ensure that you do not have to search all the replicas for all the backup files during recovery, it is
highly recommended to maintain a central location for all backups, which is accessible from all replicas.
This will ensure that during a failure, anyone can find all the backup files. Also, if the centralized location
is separate from the replicas then you can ensure that you will not lose the backup files if any of the
replicas fail.
172
CHAPTER 11 ■ DATABASE MAINTENANCE USING SECONDARY REPLICAS
Also, you can use the SSMS built-in tool Recovery Advisor to quickly create a correct and optimal restore
sequence.
■ Tip To open the Recovery Advisor tool in SSMS, right the Databases node and select Restore Database….
Click the ellipses next to the Device dialog box and add the path(s) to the backup locations. To perform point-in-
time restore, click Timeline button on the Restore Database dialog box.
All you need to provide is the location of the backup sets and the Recovery Advisor tool automatically
comes up with the correct restore sequence based on the header and LSN information available in the
backup files. Again, if you have a centralized backup location, then you will only need to provide the path to
the centralized backup location. But if for some reason, there are multiple backup locations, the Recovery
Advisor tool allows us to add the paths to all the backup locations. Of course, if you forget about one or more
backup paths then the tool will not be able to create the correct restore sequence and may throw an error.
This tool can also be used to perform point-in-time restores easily too. You can select a point-in-time to
restore by simply moving the slider on the time line or by entering the date and time values in the text boxes.
The tool automatically chooses the right backup files required to restore the database to that point.
■ Note To restore a database that is a part of an availability group, you will first need to remove it from the
availability group.
We mentioned earlier that availability groups do not support taking differential backups on the
secondary replicas. Also, copy-only full backups that are supported on the secondary replicas cannot be
used as a base for differential backups. So, how do we handle this if our backup strategy requires differential
backups? If you need to take differential backups, then you can set the backup preference to be Primary so
that the full and differential backups will always be taken from the current primary replica. But this also will
mean that the transaction log backups will take place on the current primary replica.
We know that transaction log backup jobs typically run every 15 to 30 minutes whereas the full and
differential backups jobs run once a day after business hours. So, the transaction log backups that are
running during business hours will have the maximum impact on the production load. What if you want
to offload the transaction log backups to the secondary replica and only take the full and differential log
backups on the primary replica? To achieve this, one method is to set the backup preference to Prefer
Secondary as you saw in Figure 11-3 to ensure that the transaction log backups created using maintenance
plans execute on secondary replicas. And for the full and differential backups, you can create custom jobs
that will identify the primary replica before executing the backup. You can use the built-in system function
sys.fn_hadr_is_primary_replica for this purpose. As per our discussion about this function in chapter
9, this function takes a database name as a parameter and returns a 1 when the database on the current
instance is the primary replica. If not, the function returns 0. Below is some sample code that you can use to
perform full and/or differential backups on the primary replica.
If sys.fn_hadr_is_primary_replica ( @dbname ) = 1
BEGIN
-- This is the primary replica. Perform the backup job.
END
-- This is not the primary replica, exit.
173
CHAPTER 11 ■ DATABASE MAINTENANCE USING SECONDARY REPLICAS
As you might be aware, DBCC CHECKDB creates an internal snapshot and performs copy-on-write activity
in order to check a specific point-in-time while allowing redo to progress. While performing DBCC CHECKDB
on the secondary replica, it is possible that the secondary replica is behind the primary replica. If you
want to avoid creating an internal snapshot, you can execute DBCC CHECKDB WITH TABLOCK. This operation
requires a database lock and to obtain the database lock you need to first suspend the HADR activity on the
database, run DBCC CHECKDB WITH TABLOCK and then resume the HADR activity.
■ Note Suspending the HADR activity on the database will cause a backlog of log blocks and cause the
transaction log file(s) to grow on the primary and other remaining replicas that do not have data synchronization
suspended.
Running integrity checks on the secondary minimizes the impact of the production load on the primary
replica but it does not completely replace the act of running it on the primary replica. This is due to the fact
that the primary cause of corruption issues that we see is due to I/O subsystem corruption. As we are aware,
each replica has a separate copy of the databases that are stored on a different I/O subsystem (or different
portion of the I/O subsystem) and hosted on different servers with different memory.
174
CHAPTER 11 ■ DATABASE MAINTENANCE USING SECONDARY REPLICAS
■ Note SQL Server 2016 supports running DBCC CHECKDB on databases that contain memory-optimized
tables. However, validation only occurs on the disk-based tables. Also, DBCC repair options are not available
for memory-optimized tables. Therefore, you must continue to back up your databases regularly and test
the backups. You will need to restore from the last good backup if there are any data integrity issues in the
memory-optimized tables. Also, as part of database backup and recovery, a CHECKSUM validation is done for files
in memory-optimized filegroups.
Running DBCC CHECKDB only on the secondary replica is not sufficient. It is possible that one availability
replica may be corrupted while the other replica may not be corrupted, and the only way to detect any
corruption to the database files from the I/O subsystem on a particular replica is via running DBCC CHECKDB
on that replica.
IF sys.fn_hadr_is_primary_replica('AdventureWorks2016')<>1
BEGIN
DBCC CHECKDB(AdventureWorks2014)
END
ELSE
PRINT 'This is the primary replica, exit!'
You will perform availability group failover periodically (say once every other week) so that the
next DBCC CHECKDB will run on the new secondary. This way, databases on both primary and secondary
replicas are checked for corruption while minimizing the performance impact caused by DBCC CHECKDB.
Also another benefit of this option is that you will know that all our replicas are functional and operating
optimally and in case of a real disaster you can confidently failover to any replica and continue to operate
normally. If you always use a particular server as your primary replica and use the secondary replica only for
DR purposes, then you may not be 100% sure how it will perform when you do a failover.
If for some reason, you cannot use option 1, then you can consider using option 2 - Run on all replicas
but more frequently on the secondary replicas. In this option, run DBCC CHECKDB on all replicas but run it
more frequently (once a week, for example) on the secondary replicas as compared to the primary replica
(monthly, for example). This option will not fully compensate for database integrity checks on the primary
replica but it will reduce the impact on the primary replica.
175
CHAPTER 11 ■ DATABASE MAINTENANCE USING SECONDARY REPLICAS
If you cannot use either options 1 or 2, consider using option 3 - Runlight weight checks on primary
replica. In this method, you run DBCC CHECKDB WITH PHYSICAL_ONLY to reduce the impact on the primary
replica. This option has a small overhead and a shorter run time as compared to running the full DBCC
CHECKDB and it checks the physical consistency of the database, detect torn pages, checksum failures, and
common hardware failures that can corrupt user databases.
■ Note It is still recommended to run the full DBCC CHECKDB periodically to check both the physical and
logical integrity of the user databases.
Summary
In this chapter, you learned the benefits of active secondary replicas and how to leverage the hardware of
the secondary replicas to run database backup and integrity check jobs. In the next chapter, we will discuss
other common management tasks that DBAs need to perform on an availability groups HADR solution.
176
PART V
In this chapter we will take a look at all the common management tasks that an administrator would need to
perform on an availability group. Creation of availability group is a one-time operation; however, you would need
to manage the availability groups on an ongoing basis.
Suspending an Always On secondary database does not directly affect the availability of the primary
database. However, it can impact redundancy and failover capabilities for the primary database.
After the secondary database synchronization is suspended, the transaction log of the corresponding
primary database cannot be truncated. This causes log records to accumulate on the primary database.
Therefore, try to resume or remove a suspended secondary database as quickly as possible. Also consider
adding some temporary space for logs on the primary and other secondaries as it starts growing when the
data synchronization is suspended.
■ Note Starting SQL Server 2014, you can have up to eight secondary replicas for an availability group.
To suspend the synchronization using SQL Server Management Studio, connect to the primary replica,
then under Availability Groups right-click the database and click Suspend Data Movement.
Figure 12-2 shows the Suspend Data Movement option in the SSMS GUI on Primary Replica instance.
Suspending a primary database stops the data synchronization between the primary and all the
secondary databases. In this case, the transaction log of the primary database cannot be truncated, and also
changes made to the primary databases are not propagated to any of the secondary databases. Hence in
case of a disaster on the primary replica instance there can be data loss. Therefore, try to resume the primary
database synchronization as quickly as possible. Also consider adding some temporary space for logs on the
primary and other secondaries, as it starts growing when the data synchronization is suspended.
180
CHAPTER 12 ■ COMMON MANAGEMENT TASKS
If a secondary database was suspended locally, that is, from the SQL Server instance that hosts the
secondary replica, then that secondary database must be resumed locally.
Once the primary database is in the SYNCHRONIZING state, we know that the synchronization has
resumed on the secondary database.
181
CHAPTER 12 ■ COMMON MANAGEMENT TASKS
To change the Availability Mode of a replica using SQL Server Management Studio, connect to the
primary replica and then under Availability Groups, right-click the availability group and click on Properties.
Use the Availability Mode drop-down list to change the availability mode of the replica.
Once Availability Mode is changed, review the availability group configuration to ensure that the
availability group will be able to meet the RTO and RPO requirements in the SLA.
Figure 12-4 shows the Availability Mode setting in the Availability Group Properties.
Add a Database
A database can be added to an availability group by adding it first to the primary replica instance and then
restoring and adding it to the secondary replica instances. A database can also be added to a secondary
replica instance if it is already present on the primary replica instance. This is possible if the availability
group was created with Skip Initial Synchronization option or if the secondary database was removed from
the availability group due to performance bottleneck.
If the database is added to the primary replica instance, then add it to all the secondary replica
instances at the earliest chance. If automatic seeding is enabled for an availability group, SQL Server
automatically creates the secondary replicas when you add a database to the availability group. That means
182
CHAPTER 12 ■ COMMON MANAGEMENT TASKS
you no longer have to manually back up and restore secondary replicas. For automatic seeding to work,
the data and log file path should be the same on every SQL Server instance participating in the availability
group. Also the database should be in full recovery model and should have a current full and transaction log
backup.
To add a database to an availability group using SQL Server Management Studio, connect to the
primary replica instance and under Availability Groups right-click Availability Group and click Add Database
to launch the Add Database to the Availability Group wizard.
Figure 12-5 shows the Add Database option in the SSMS GU.I
To add a database to a secondary replica of an availability group using SQL Server Management Studio,
first ensure that the database is prepared for availability group. Then connect to the secondary replica
instance and under Availability Groups, right-click the Availability Group and click Join to Availability Group
to launch the Join Database to Availability Group wizard.
Remove a Database
The transaction log can start growing when database synchronization is suspended. If the synchronization
cannot be resumed, then the database should be removed from the secondary replica. The removed
secondary database goes into the RESTORING state.
Figure 12-6 shows the Remove Database option in the SSMS GUI.
183
CHAPTER 12 ■ COMMON MANAGEMENT TASKS
When a replica database is removed from the primary replica, the database is removed from the
Availability Group on all the replicas. It remains online on the primary replica and on the secondary replicas
it goes into the RESTORING state.
To remove a database from a secondary replica of an availability group using SQL Server Management
Studio, connect to the secondary replica and under the Availability Groups, right-click the database and click
Remove Secondary Database to launch the Remove Database from Availability Group wizard.
To remove a database from an availability group using SQL Server Management Studio, connect to the
primary replica instance and then under Availability Groups, right-click the database to launch the Remove
Database from Availability Group wizard.
If the database was removed due to performance considerations, then plan to add it back to the
availability group at the earliest chance.
Add a Replica
A new replica can be added to an existing availability group.
Before you add an instance to an availability group as a secondary replica, first ensure that the instance
satisfies all the prerequisites discussed in the prerequisites chapter.
To add a replica to an availability group using SQL Server Management Studio, connect to the primary
replica and then under Availability Groups, right-click the Availability Group and click Add Replica to launch
the Add Replica to the Availability Group wizard.
Once the replica is added, review the replica properties such as Availability Mode, Backup
Preference, etc.
Figure 12-7 shows the Add Replica option in the SSMS GUI.
184
CHAPTER 12 ■ COMMON MANAGEMENT TASKS
Remove a Replica
A replica can be removed from an existing availability group. If an availability group has multiple databases
and if a secondary replica is causing performance issues to the primary replica, it might be better to remove
the secondary replica than removing selected replica databases.The most common reason to remove
the secondary replica is that it is either disconnected or down and you no longer want to be part of the
availability group.
To remove a replica from an availability group using SQL Server Management Studio, connect to the
primary replica instance and under the Availability Groups, right-click the replica that you want to remove
and click Remove from Availability Group to launch the Remove Secondary Replica from the Availability
Group wizard.
Figure 12-8 shows the Remove Replica option in the SSMS GUI.
185
CHAPTER 12 ■ COMMON MANAGEMENT TASKS
Once the replica is removed, review the availability group properties such as Availability Mode, Backup
preference, etc. to ensure that you will still be able to achieve RTO and RPO objectives with the change in the
configuration.
186
CHAPTER 12 ■ COMMON MANAGEMENT TASKS
Deleting an availability group also deletes the listener. However, the Virtual Computer Object (VCO) for
the listener still remains in the Active Directory. Clean up the VCO from the AD OU post availability group
deletion.
187
CHAPTER 12 ■ COMMON MANAGEMENT TASKS
Table 12-1. Cluster properties and their values to tune the cluster heartbeats
The existing heartbeat settings can be viewed by running the following PowerShell command:
PS C:\Windows\system32> get-cluster | fl *subnet*
To change the setting, the following PowerShell command can be used
PS C:\Windows\system32> (get-cluster).CrossSubnetThreshold = 20
■ Note You cannot create multiple listener names using the SSMS GUI.
188
CHAPTER 12 ■ COMMON MANAGEMENT TASKS
Summary
In this chapter we covered some of the common management tasks that you might have to perform
on availability groups. In the next chapter we will take a look on how to upgrade, update, and migrate
availability groups.
189
CHAPTER 13
Microsoft SQL Server team announced the general availability of the SQL Server 2016 on June 1, 2016. Your
company’s mission-critical production application currently uses SQL Server 2014 Always On Availability
Groups. Your business unit wants you to start testing the upgrade to SQL Server 2016 so that the production
availability group (AG) environment can be upgraded in six months. Everything needs to be tested and one
of the critical requirements for the upgrade is that the downtime for the production client applications needs
to be minimum as it is a 24/7 application. You have been asked to keep the actual downtime that the client
applications will experience during the upgrade to only few minutes (say less than 5 mins.). You might be
wondering that this is wishful thinking and if it is even possible to perform an upgrade from SQL Server 2014
to SQL Server 2016 with less than 5 minutes of downtime. From past experiences, you know that even for a
regular stand-alone SQL Server instance upgrade, the typical downtime is around 1 to 4 hours. And this is
an Always On Availability Group environment with multiple SQL Server instances running on a Windows
Server Failover Cluster (WSFC). How are you going to achieve this task?
■ Note The rolling upgrade and update process are exactly the same. Even though the following sections
discuss about rolling upgrade, you can simply replace the word upgrade with update and use the same process
to apply an update to availability group replicas.
Prerequisites
Even though rolling upgrade lets us minimize the actual downtime for the client applications, it does not
mean the actual process of upgrading takes only a few minutes. The actual upgrade itself takes much
longer but business is happy as long as the client applications are not experiencing any outage and can
perform their work normally. We cannot stress enough that it is very important to plan and test the upgrade
thoroughly in a test environment to achieve minimal downtime and risk. Only after thorough testing
is completed and all your time requirements are met, can you perform the upgrade on the production
environment.
Here are some important prerequisites that should be completed before the upgrade to SQL Server 2016:
• Review the hardware and software requirements for installing SQL Server 2016.
• Review your current SQL Server and Windows Server versions and editions. Do
they have a supported upgrade path to SQL Server 2016? If you are on SQL Server
2012 and want to upgrade to SQL Server 2016 then you will need a minimum of
SP1 for SQL Server 2012 installed. You can upgrade SQL Server 2014 release to
manufacturing (RTM) and up to SQL Server 2016. For more information, review the
Microsoft MSDN article titled “Supported version and edition upgrades” at https://
msdn.microsoft.com/en-us/library/ms143393.aspx
■ Note When you upgrade availability group replicas from a prior version of SQL Server Enterprise Edition
to SQL Server 2016, you will need to choose between Enterprise Edition: Core-based licensing and Enterprise
Edition. Enterprise Edition: Core-based licensing supports all the cores as reported by the operating system
whereas the Enterprise Edition (not available for new agreement) is limited to a maximum of 20 cores per SQL
Server instance.
• Review the SQL Server 2016 backward compatibility topic to review the changes in
behavior between SQL Server 2016 and the SQL Server version you are upgrading from.
• Run the SQL Server 2016 Data Migration Assistant tool to see if there are any upgrade
blockers, and if any modifications have to be made to scripts or applications due to
breaking changes. You can download the SQL Server 2016 Data Migration Assistant
tool from http://www.microsoft.com/en-us/download/details.aspx?id=53595.
SQL Server Upgrade Advisor tool is deprecated and replaced with Data Migration
Assistant tool.
• Run the SQL Server 2016 System Configuration Checker tool to see if the SQL Server
setup programs detect any blocking issues before actually upgrading. To run the
System Configuration Checker tool, run SQL Server 2016 setup.exe, click Tools on the
left-hand side, and then click System Configuration Checker on the right-hand side.
• Plan and Thoroughly Test the Upgrade Plan and Document All the Steps with
Detailed Information.
■ Note Failure to fail over to an upgraded secondary replica before upgrading the primary replica will affect
the applications and users and will extend the downtime during the primary replica upgrade.
• If the replicas are SQL Server failover clustering instances (FCIs), then upgrade the
inactive node of the FCI before you upgrade the active node.
Figure 13-1. Rolling Upgrade Process for an AG environment used for HA and DR
193
CHAPTER 13 ■ UPGRADING AND MIGRATING
The steps for rolling upgrade for our availability group environment are as follows:
1. Remove automatic failover from all synchronous-commit replicas (AlwaysOnN1
and AlwaysOnN2). This is to prevent any automatic failovers during the upgrade
process.
2. Upgrade remote secondary replica (AlwaysOnN3).
3. Upgrade local secondary replica (AlwaysOnN2). At this point, we have upgraded
all secondary replicas and the client applications connected to the primary
replica (AlwaysOnN1) have not experienced any downtime.
■ Note During the upgrade process, the secondary replica is not available for failover or for read-only
reporting workloads. But the production OLTP workload on the primary replica is not affected. After the upgrade
of the secondary replicas, depending upon the activity on the primary replica, it may take some time for the
secondary replica to catch up with the primary replica.
4. Ensure that the local secondary replica (AlwaysOnN2) has caught up with the
primary replica and is in a SNYCHRONIZED state. Whenever you have some
downtime, manually fail over the availability group to the SYNCHRONIZED
secondary replica. After the availability group fails over to AlwaysOnN2, client
applications can get connected again and work normally. The only downtime
they will experience is the time it takes the availability group to fail over from
AlwaysOnN1 to AlwaysOnN2, which is typically a few seconds. To find the exact
downtime, it is important to test in your test environment.
5. Upgrade the former primary replica (AlwaysOnN1).
6. Configure the automatic failover mode for the replicas (AlwaysOnN1 and
AlwaysOnN2).
■ Note If you need AlwaysOnN1 to be the primary replica, wait for it to catch up with the AlwaysOnN2 and
achieve SYNCHRONIZED state. Then manually fail over the availability group from AlwaysOnN2 to AlwaysOnN1.
This will cause some downtime (few seconds) and the total downtime will be equal to two manual failovers.
194
CHAPTER 13 ■ UPGRADING AND MIGRATING
Figure 13-2. Rolling Upgrade Process for an AG environment used for only HA
■ Note If you need AlwaysOnN1 to be the primary replica, wait for it to catch up with the AlwaysOnN2 and
achieve SNYCHRONIZED state. Then manually fail over the availability group from AlwaysOnN2 to AlwaysOnN1.
This will cause some downtime (few seconds) and the total downtime will be equal to two manual failovers.
195
CHAPTER 13 ■ UPGRADING AND MIGRATING
Figure 13-3. Rolling Upgrade Process for an AG environment used for only DR
196
CHAPTER 13 ■ UPGRADING AND MIGRATING
Figure 13-5. Rolling Upgrade Process for an AG environment with SQL Server FCIs
197
CHAPTER 13 ■ UPGRADING AND MIGRATING
198
CHAPTER 13 ■ UPGRADING AND MIGRATING
■ Note In this environment, the rolling upgrade process causes an average downtime of less than two
failovers per availability group.
Figure 13-6. Cluster OS Rolling Upgrade Workflow for each Windows Server 2012 R2 cluster node
199
CHAPTER 13 ■ UPGRADING AND MIGRATING
■ Note During a Cluster OS Rolling Upgrade, it is recommended to use a clean Windows Server 2016 install,
which means you will need to reinstall SQL Server and add the replica to the availability group. In-place Windows
upgrade is not encouraged although it may work in some cases where default drivers are used. The actual downtime
for availability group primary replica during a Cluster OS Rolling Upgrade is the time taken for one manual failover.
Once Windows Server 2016 is released, we recommend you review the Technet article at https://technet.
microsoft.com/windows-server-docs/compute/failover-clustering/cluster-operating-
system-rolling-upgrade to see if they are any updates to the Cluster OS Rolling Upgrade process.
Cluster OS Rolling Upgrade starts with a Windows Server 2012 R2 cluster and includes the following
high-level steps:
1. Select half of the cluster nodes that have the secondary replicas. Remove the
secondary replicas from the existing availability group.
2. Pause, drain, evict, reformat, and install Windows Server 2016 on these cluster
nodes.
3. On the Windows Server 2016 nodes, add Failover Clustering feature, configure
networking and storage connectivity, and add the node to the existing Windows
Server 2012 R2 cluster.
4. Reinstall SQL Server and add the replica to the availability group. At this time,
availability group can fail over to this cluster node if required. The time required
to perform the manual fail over is the only downtime that the clients connecting
to the primary replica will experience.
5. Perform steps 2, 3, and 4 to the remaining half nodes.
6. At this time, all the cluster nodes are upgraded to Windows Server 2016. At this
point, the process can be fully reversed, and Windows Server 2012 R2 nodes can
be added to the cluster.
7. The cluster is still running at the Windows Server 2012 R2 functional level.
8. When you are certain that you want to run at Windows Server 2016 functional
level, run the Update-ClusterFunctionalLevel PowerShell cmdlet.
■ Note After you update the cluster functional level, you cannot go back to Windows Server 2012 R2
functional level and Windows Server 2012 R2 nodes cannot be added to the cluster.
Cross-Cluster Migration
Let’s say the operating system on your availability group cluster nodes is Windows Server 2008 R2 and you
have been asked to plan and upgrade to Windows Server 2012 R2. You know that upgrading the operating
system hosting the WSFC nodes is not supported before Windows Server 2012 R2. So how are you going to
upgrade the operating system? To address this, Microsoft introduced cross-cluster migration starting in SQL
Server 2012 SP1. The cross-cluster migration lets us move one or more availability groups from one WSFC
(source cluster) to a new version WSFC (destination cluster) with little downtime.
200
CHAPTER 13 ■ UPGRADING AND MIGRATING
■ Note Cross-cluster migration of Always On Availability Groups is intended primarily for an operating
system upgrade to a Windows Server 2012 (or R2) cluster.
To support cross-cluster migration, an equal or higher version of SQL Server (a minimum of SQL Server
2012 SP1) must be installed on two or more nodes of the destination cluster, and these SQL Server instances
must be enabled for Always On. Figure 13-7 shows two WSFCs. The first one (source cluster) is a two-node
Windows Server 2008 R2 cluster, each hosting a stand-alone SQL Server instance (SQL Server 2012 SP1 or
higher). Both the instances are part of an availability group AG1 with instance 1 being the primary replica
and instance 2 being the synchronous secondary replica. The second cluster (destination cluster) is a two-
node Windows Server 2012 R2 cluster. The goal is to migrate AG1 from source cluster to destination cluster
with minimum downtime.
Figure 13-7. Upgrading operating system from Windows Server 2008 R2 to 2012 R2 using cross-cluster
migration
Preparation
This phase prepares the destination cluster and involves the following steps:
1. Create a new Windows Server 2012 R2 cluster.
2. Install SQL Server 2012 SP1 or higher on each cluster node.
3. Configure each SQL Server instance to support availability group.
4. On each source cluster node, grant cluster registry permissions to the service
account of each SQL Server instance on the destination cluster node.
5. For each availability group listener that needs to be migrated, grant the
destination cluster full permission on the virtual network name (VNN) computer
object in the Active Directory Server.
201
CHAPTER 13 ■ UPGRADING AND MIGRATING
Data Migration
This phase involves creating and configuring two new secondary replicas on the destination cluster for each
availability group. Figure 13-8 depicts the data migration phase.
■ Note Switching HADR cluster context is supported starting from SQL Server 2012 SP1.
202
CHAPTER 13 ■ UPGRADING AND MIGRATING
Resource Migration
This phase moves the availability group resources from the source cluster to the destination cluster.
Figure 13-9 depicts the resource migration (planned downtime) phase.
■ Note SQL Server 2012 SP1 introduced a new ALTER AVAILABILITY GROUP option for taking an availability
group offline.
3. On each destination SQL Server instance, switch the HADR cluster context to
LOCAL cluster.
4. For each availability group in the batch, create new availability group on the
destination instance.
5. Create new availability group listener.
After migrating your availability groups, remember to perform the post-installation tasks as discussed in
chapter 9. For example, create the SQL Agent jobs on the destination SQL Server instances for the databases
that were migrated.
■ Note Starting with SQL Server 2016, distributed availability groups (DAGs) as discussed in chapters 3 and
8 are another option to perform rolling upgrades of the operating system. After upgrading the operating system
on the second WSFC, wait until the second availability group catches up. Then perform a manual failover to the
second availability group and drop the distributed availability group using the DROP AVAILABILITY GROUP T-SQL
command. As distributed availability groups associate two availability groups that have their own listeners, you could
use the old database mirroring connection string syntax for transparent client redirection after failover. For example:
"Server=AGCorp1_listen; Failover_Partner=AGCorp2_listen; Database=AdventureWorks2016"
203
CHAPTER 13 ■ UPGRADING AND MIGRATING
Summary
In this chapter, we discussed Always On Availability Group rolling upgrade and rolling update features.
These features enable us to upgrade the SQL Server version and apply SQL Server and Windows Server
updates like security patches, cumulative updates, and service packs with minimal downtime and minimal
risk. We also discussed the steps to upgrade the Windows operating system using Cluster OS Rolling
Upgrade feature introduced in Windows Server 2016 to perform rolling upgrade of the operating system
Windows Server 2012 R2 to Windows Server 2016. Last, we discussed the cross-cluster migration feature
that was introduced in SQL Server 2012 SP1 to upgrade the operating system Windows Server 2008 R2 to
Windows Server 2012 R2. In the next chapter, we will review the common database maintenance tasks that
an administrator would need to perform on an availability group.
204
CHAPTER 14
In this chapter we will take a look at some of the considerations for the database maintenance tasks that an
administrator would need to perform on an availability group.
Since the secondary replica databases are read-only, index maintenance and statistics updates need to
be performed on the primary replica SQL Server instance.
Index Maintenance
As Index maintenance is performed on the primary, it not only impacts the primary replica, but it can also
impact the data synchronization between replicas. Index maintenance operations can involve massive
amounts of data, and this data change would also need to be synchronized with the secondary replicas. This
impacts the performance of data synchronization causing data transfer delays in case of asynchronous-
commit and performance impact on the primary in case of synchronous-commit.
additional synchronization overhead for databases in an availability group. Take this into consideration
while deciding on rebuild vs reorganize. Also it is recommended to rebuild if the fragmentation is more than
30% and reorganize if less than 30% and more than 10%. Defragmenting small tables may not provide any
performance benefit, hence it is better to check for fragmentation in tables with page count greater than
1000 pages.
Switch Availability Mode
Significant amounts of data get written to the transaction log on the primary replica instance while
rebuilding, which it then sends to the secondary replica instances. Hardening of transaction logs on a
secondary replica instance can be slowed down due to this additional workload. In case of a synchronous-
commit replica, this can result in a higher than normal HADR_SYNC_COMMIT wait on the primary. To
avoid this situation, consider switching the availability Mode of all the secondary replicas to asynchronous-
commit during the maintenance period and then switch it back to the original mode after the maintenance
is completed.
Statistics Updates
Although temporary statistics are created and maintained on the Read-only replica, there are some
maintenance tasks still that need to be performed on the primary replica. The permanent statistics on the
primary replica are replicated to the secondary replicas.
■ Note Missing statistics are automatically created both on primary and secondary replica.
206
CHAPTER 14 ■ PERFORMING DATABASE MAINTENANCE TASKS
Memory-Optimized Tables
Memory-optimized tables are supported in availability groups. However, automatic statistics update for
memory-optimized tables on the primary or secondary replica is not supported prior to SQL Server 2016.
Hence monitor the query performance and execution plans on the secondary replica and manually update
the statistics on the primary replica when needed.
■ Note Automatic update of statistics is now supported starting SQL Server 2016 when using database
compatibility level of 130, for memory-optimized tables.
Summary
In this chapter we covered some of the database maintenance tasks that you might have to perform on
availability groups. In the next chapter we will take a look how to monitor availability groups.
207
PART VI
Now that you have successfully deployed Always On Availability Groups and configured them per
Microsoft best practices, it’s time to discuss monitoring the health of the availability groups. Monitoring
availability groups is an important activity that allows us to review the availability groups quickly to
ensure that everything is working as expected. Also, the data generated by the monitoring tools are
invaluable while troubleshooting a problem and finding the root cause of the problem. In this chapter,
we will explore different ways to monitor the current health and synchronization status of availability
groups.
Using Dashboard
We discussed about the Always On Availability Groups dashboard in chapter 9. As a refresher, the
dashboard is a graphical user interface (GUI) that, similar to your automobile’s dashboard, organizes
and presents the information about availability groups in a way that is easy to interpret. The dashboard
displays the health and details of the availability groups, availability replicas, and availability
databases that allow database administrators (DBAs) to make quick operational decisions. It shows the
synchronization states and provides performance indicators that allow us to monitor the availability
groups without writing a single line of code. Figure 15-1 shows a sample Always On dashboard when
launched from the primary replica.
Figure 15-1. Sample availability group dashboard launched from the primary replica
The dashboard displays a lot of information and is also highly configurable. There are many hidden
columns that can be added for additional information. Below we will discuss some of the useful columns:
• Failover Mode lets us know the failover mode for which the replica is configured. The
possible values are Automatic or Manual.
• Availability Mode lets us know the availability mode for which the replica is
configured. The possible values are synchronous or asynchronous.
• Synchronization State column is in both the availability replica health section and
the availability database health section. In the availability replica health section, it
tells us whether the secondary replica is currently synchronized with the primary
replica. The typical values are synchronized, synchronizing, or not synchronized.
If the local SQL Server instance cannot communicate with WSFC then the value is
NULL. In the availability database health section, it tells us whether the availability
database is currently synchronized with the primary replica. The possible values
here are synchronized, synchronizing, not synchronizing, reverting, or initializing.
■ Note Do not force failover to the secondary replica when a database is in the reverting or initializing state
as that may leave the database in a state in which it cannot be started.
• Issues tells us if there is any issue. The possible values are Warnings or Critical. To
review the issue details, click the hyperlink for the Warning or Critical.
• Connection State indicates whether the secondary replica is currently connected to
the primary replica. The possible values are connected and disconnected.
212
CHAPTER 15 ■ MONITORING AVAILABILITY GROUPS
• Last Connection Error No., Description and Timestamp columns provide us with the
timestamp of the last connection error along with the number and brief description.
• Failover Readiness tells us to which replica can the availability group be failed over
with or without potential data loss.
• Estimated Data Loss (seconds) is a very useful column to monitor recovery point
objective (RPO). It tells us the time in seconds that the data is not synchronized to
the secondary database. If the primary replica fails, all data within this time window
will be lost.
• Estimated Recovery Time (seconds) is a very useful column to monitor the recovery
time objective (RTO). It tells us the time in seconds it will take for the secondary
replica to catch up with the primary replica.
• Log Send Queue Size (KB) indicates the amount of log records in KB needed
to be sent from the primary database to the secondary database to complete
synchronization.
• Log Send Rate (KB/sec) indicates the rate in KB per second at which the log records
are being sent from the primary replica to the secondary replica.
• Redo Queue Size (KB) indicates the amount of log records in KB needed to be redone
in the secondary database to complete synchronization.
• Redo Rate (KB/sec) indicates the rate in KB in second at which the log records are
being redone in the secondary database to complete synchronization.
Figure 15-2 shows the dashboard with some of the useful hidden columns displayed.
Using Transact-SQL
SQL Server 2016 availability groups has a rich set of catalog views, dynamic management views (DMVs), and
functions (DMFs) that allow us to monitor the availability groups and availability replicas and availability
databases.
213
CHAPTER 15 ■ MONITORING AVAILABILITY GROUPS
214
CHAPTER 15 ■ MONITORING AVAILABILITY GROUPS
Figure 15-3. Monitoring the health of the availability groups using T-SQL
The next T-SQL script can be used to monitor the current workload and synchronization status of the
availability group:
215
CHAPTER 15 ■ MONITORING AVAILABILITY GROUPS
Figure 15-4. Monitoring current workload and synchronization status using T-SQL
The next T-SQL script can be used to monitor the automatic seeding progress and performance using
the two new DMVs introduced in SQL Server 2016:
SELECT start_time,
completion_time,
is_source,
current_state,
failure_state,
failure_state_desc
FROM sys.dm_hadr_automatic_seeding;
■ Note A lot of the Always On wait types are expected waits when everything is running normally. Just
because a wait type appears on the top of the list, it does not mean that there is some issue. For a description
of all the wait types, refer to the MSDN article https://msdn.microsoft.com/en-us/library/ms179984.
aspx. It is recommended to have a baseline of the wait types when everything is working normally so that the
values can be compared with the wait types when there is a performance issue.
216
CHAPTER 15 ■ MONITORING AVAILABILITY GROUPS
Let’s review some of the Always On wait types that are commonly seen when you are monitoring the
wait types:
• HADR_SYNC_COMMIT is expected for synchronized availability groups. This wait
type tracks the time it takes the primary to send the transaction log blocks to the
secondary, for the secondary to receive them, harden them, and sent the commit
acknowledgement back to the primary. A consistently high value for this wait type
could mean that the synchronous secondary has a performance bottleneck that is
affecting the commits on the primary replica instance. This wait type is also reflected
by the Transaction Delay performance monitor counter.
• HADR_CLUSAPI_CALL is very frequently seen when reviewing the wait statistics in an
availability group environment. There is nothing to be worried about it even if you
see this at the top of your wait statistics list. All this wait type tells us is that the SQL
Server thread is waiting to switch from non-preemptive mode (scheduled by SQL
Server) to preemptive mode (scheduled by Windows Server) to invoke the WSFC
APIs. As you know, availability groups work very closely with the WSFC, there are
many cluster APIs and activities that are being executed and it is very natural to see
this one at the top of the list.
• HADR_LOGCAPTURE_WAIT indicates the time SQL Server is waiting for the log records to
be become available. If the hardening is completely caught up and there are no log
blocks that are waiting to be hardened to the transaction log on the disk, SQL Server
will wait to get the next log block. So if the log scan is completely caught up, you will
actually see a high value for this wait type. This is expected and does not mean that
something is necessarily wrong.
• HADR_SYNCHRONIZING_THROTTLE indicates the time it takes a synchronizing
secondary database to catch up with the primary database in order to transition
from synchronizing state to synchronized state. This is an expected wait when a
secondary database is trying to catch up with the primary database. If you are having
latency issues and you see this wait type on the top, you may consider switching
the secondary replica to asynchronous commit and later during off-peak hours
when the estimated data loss on that secondary nears zero, you can switch back to
synchronous-commit mode and it will quickly change its status to synchronized.
• WRITELOG is not an Always On wait type but you see this quite often in an availability
group environment. This wait type indicates the time it takes for SQL Server to
harden the log blocks to the transaction log. If you see a high value for this wait type,
it typically means that you may have a disk subsystem bottleneck. It means that
SQL Server is sending writes to the transaction log but the writes are not happening
quickly enough. This is when you want to review the disk subsystem, review disk
performance monitor counters, and work with storage administrators as a high value
of this wait type can have a direct impact on the availability group synchronization,
which in turn will affect your service level agreements (SLAs).
217
CHAPTER 15 ■ MONITORING AVAILABILITY GROUPS
■ Note All the availability replica performance counters can be traced on both the primary and the secondary
replicas. Some counters may show a zero value depending on the current role of the local replica.
• Bytes Received from Replica/sec indicates the number of bytes received from the
availability replica per second.
• Bytes Sent to Replica/sec indicates the number of bytes sent per second to the remote
replica. On a primary replica, this counter indicates the number of bytes sent per
second to the secondary replica. On a secondary replica, this counter indicates the
number of bytes sent per second to the primary replica.
■ Note Bytes Sent to Replica/sec on the primary replica reflects the same values as Bytes Received from
Replica/sec on the secondary replica.
• Bytes Sent to Transport/sec indicates the actual number of bytes sent per second to
the remote replica over the network. This counter closely follows the same pattern
as Bytes Sent to Replica/sec but it averages higher as the transport bytes also contain
control block overhead.
• Resent Messages/sec indicates the number of Always On messages resent in the last
second. Basically it means that SQL Server has sent the message but something
happened (may be corruption) and SQL Server has to send the message again. A
non-zero value for this performance counter should be further investigated to see
why messages are being sent over and over again. Trace this counter at the primary
and secondary replicas.
The preceding counters are specific to the availability replica. The next set of counters are specific to
every availability database. These counters are part of the SQL Server: Database Replica object.
• Log Send Queue indicates how much log (in KB) has not been sent from the primary
database to the secondary database. This counter tells us the amount of log the
secondary database does not have at the time of failover and the amount of data loss
customers will experience. A high value of log send queue counter may directly mean
that you may not be able to achieve our RPO. You may be looking at data loss if you have
a high log send queue value and the only available replicas are asynchronous replicas.
The log send queue is also reported in the log_send_queue_size column of the sys.
dm_hadr_database_replica_states DMV. Trace this counter at the primary replica.
■ Note As Log Send Queue counter rises, the Log Bytes Received/sec counter trends down.
• Transaction Delay indicates the time (in milliseconds) it takes the primary to send
the transaction log blocks to the secondary, for the secondary to receive them,
harden them, and sent the acknowledgement commit back to the primary. This
is similar to the HADR_SYNC_COMMIT wait type. Since asynchronous commit mode
does not need acknowledgement to commit a transaction, this counter reports 0
for database in asynchronous commit mode. For multiple secondaries, this counter
measures the total time all transactions wait for the secondary acknowledgement.
Trace this counter at the primary replica.
218
CHAPTER 15 ■ MONITORING AVAILABILITY GROUPS
• Redo Bytes Remaining indicates the amount of log records that is remaining to be
redone to finish the reverting phase. Trace this counter at the secondary replica.
■ Note If Log Bytes Received/sec trends higher than Redone Bytes/sec for a sustained period of time, then
it indicates that REDO latency is building up between the primary and the secondary replicas. If Redo Bytes
Remaining and Recovery Queue are growing, then it also indicates that REDO is the bottleneck.
• Redo Blocked/sec indicates the number of times the redo thread is blocked per
second on locks held by database readers.
While monitoring the Always On performance counters, you may also want to monitor the Log Bytes
Flushed/sec counter in the SQL Server: Databases object. This is not an Always On performance counter
but is very useful. This counter indicates the total number of log bytes flushed per second. It tells us at what
rate you are writing to the transaction log. If the Log Bytes Flushed/sec is very low, then it may indicate some
disk subsystem bottleneck. Review other disk performance counters to review the disk subsystem and
collaborate with your storage administrators as required.
■ Note You may have noticed that we haven’t provided any threshold values for the performance counters
because none exist. There are no magic values for these counters. Hence, it is very important to have a baseline
by monitoring these performance counters when everything is working fine. That way when things are not
working fine, you can collect the performance counters and compare the values with the baseline.
219
CHAPTER 15 ■ MONITORING AVAILABILITY GROUPS
As a refresher, data synchronization in synchronous-commit mode is shown in Figure 15-5 and the
useful metrics that can be collected during the data synchronization are discussed next:
220
CHAPTER 15 ■ MONITORING AVAILABILITY GROUPS
■ Note If the primary replica were to fail at this time and there is only asynchronous secondary replica
available then you may lose data. You can divide log_send_queue_size (KB) by log_send_rate (KB/sec) to
calculate the data loss and see if you are meeting your RPO.
In case of multiple secondary replicas, there is one Log Capture thread for each
replica, which ensures that the log blocks are sent across multiple replicas in
parallel. The log content gets compressed and encrypted before being sent
over to the secondary replicas. The actual number of bytes that are sent can be
tracked by the performance monitor counter Bytes Sent to Transport/sec in the
SQL Server: Availability Replica counter.
5. On the secondary replica, Log Receive thread receives the log blocks from the
network. The number of bytes received from the replica per second can be
tracked by capturing the performance monitor counter Bytes Received from
Replica/sec in the SQL Server: Availability Replica object.
6. It writes to the log cache on the secondary replica.
7. When the log block on the secondary replica becomes full, or it receives a
commit log record, it hardens the content of log cache onto the log disk on the
secondary replica. This can be tracked by capturing the performance monitor
counter Log Bytes Flushed/sec in the SQL Server: Databases object to track the
rate at which the log hardening is occurring on the secondary.
8. If the secondary replica is configured to run in synchronous mode, it will send
an acknowledgement on the commit to the primary node indicating that it has
hardened the transaction, and so it is safe to tell the user that the transaction is
committed. At this point you can monitor the wait type HADR_SYNC_COMMIT. What
if the primary replica is having some disk bottlenecks or the primary is so slow
with the amount of workload running on the primary that it still hasn’t hardened
the transaction log blocks on the primary replica? The additional wait that is
being introduced due to this issue will be tracked using the WRITELOG wait type.
So if you see high values of WRITELOG wait type on the primary replica, it means
that as far as transaction log hardening is concerned, the primary replica is
slower than the synchronous-commit secondary replica. At this point, if you see
the commits are delayed or queries are timing out, focus on the primary replica
instead of worrying about the secondary replicas.
9. The redo thread runs independently of how log blocks are being generated on
the secondary or being copied and persisted. If the redo thread is a running few
minutes behind, those log blocks may not be available in the Log Cache. In that
case, the redo thread will pick up those log blocks from the log disk, and that is
what is shown in the dotted line on the right side of the figure above. You can
track the redo thread activity by monitoring the performance monitor counters
Redone Bytes/sec and Recovery Queue in the SQL Server: Database Replica
object and the redo_queue_size (KB) and redo_rate (KB/sec) in the dmv sys.
dm_hadr_database_replica_states. To track the RTO, divide redo_queue_size
(KB) by redo_rate (KB/sec).
221
CHAPTER 15 ■ MONITORING AVAILABILITY GROUPS
If a system policy evaluates to false, you will see an error/warning in the dashboard. If you click the
hyperlink for the error/warning, you will see the underlying policies that evaluated to false and their
corresponding messages.
■ Note Policy-Based Management (PBM) is a feature that was originally introduced in SQL Server 2008
to manage SQL Server instances. PBM allows us to define and enforce policies for the SQL Server instances.
If you are not familiar with PBM, refer to Microsoft’s MSDN article “Administer Servers by Using Policy-Based
Management” at https://msdn.microsoft.com/en-us/library/bb510667.aspx for more information.
222
CHAPTER 15 ■ MONITORING AVAILABILITY GROUPS
■ Note If an availability group contains more than one availability database, then the availability database
with the highest Tfailover and Tdata_loss becomes the limiting value for RTO and RPO compliance respectively.
Next we’ll walk through the process of creating a policy to monitor the RTO and RPO metrics, and
integrating this policy into the Always On health model. We will create two policies that will have the
following characteristics:
• An RTO policy that is evaluated every 5 minutes and will fail when the estimated
failover time exceeds 10 minutes.
• An RPO policy that is evaluated every 30 minutes and will fail when the estimated
data loss exceeds 1 hour.
• The policy is evaluated only if the local availability group replica is the primary
replica.
• Policy failures are displayed on the primary replica’s dashboard.
If you have used PBM before, you may already know that the process for creating and managing polices
includes four steps:
1. Select a facet and configure its properties.
2. Set the condition that specifies the state of the facet.
3. Create a policy with the condition/s that you have created, the targets that are to
be evaluated with the condition, and its evaluation mode.
4. Execute the policy to verify if the SQL Server instance is in compliance with the
policy.
223
CHAPTER 15 ■ MONITORING AVAILABILITY GROUPS
Follow the step-by-step instructions below on all the SQL Server instances that are participating in your
Always On Availability Group environment.
Once you find a facet that seems to describe what you are looking for, right-click it and select Properties.
For our example, we will be using the Availability Group and Database Replica State facets.
■ Note For more information on all the properties of Availability Group facet, visit https://msdn.
microsoft.com/en-us/library/microsoft.sqlserver.management.smo.availabilitygroupstate
(v=sql.130).aspx.
For more information on all the properties of Database Replica State facet, visit https://msdn.microsoft.com/
en-us/library/microsoft.sqlserver.management.smo.databasereplicastate(v=sql.130).aspx.
The Availability Group facet has eight different properties as shown in Figure 15-8. One of the property
is LocalReplicaRole that is used to get the role (primary, secondary, resolving or unknown) of the local
availability replica.
224
CHAPTER 15 ■ MONITORING AVAILABILITY GROUPS
The Database Replica State facet has 36 different properties as shown in Figure 15-9.
225
CHAPTER 15 ■ MONITORING AVAILABILITY GROUPS
Step 2: Set the Condition That Specifies the State of the Facet
Once you have identified the facet and the property you want to create a policy on, the next step is to create
the property condition. You will create the following three conditions:
1. Check whether the local availability replica for a given availability group is the
primary replica.
2. Check if the potential failover exceeds 10 minutes, including 60 seconds
overhead for both failure detection and failover.
3. Check if the potential data loss exceeds 1 hour.
To create the first condition, right-click Availability Group facet and select New Condition as shown in
Figure 15-10.
Figure 15-10. Creating new condition for the availability group facet
226
CHAPTER 15 ■ MONITORING AVAILABILITY GROUPS
Enter the following specifications to create the condition and click OK:
• Name: IsPrimaryReplica
• Field: @LocalReplicaRole
• Operator: =
• Value: Primary
Figure 15-11 shows the condition that will check whether the local availability replica for a given
availability group is the primary replica.
Figure 15-11. Condition to check if the role for the local replica is primary
Next, you will create the second condition. To create the second condition, right-click Database Replica
State facet, select New Condition and use the below specifications:
• Name: RTO
• Field: Add(@EstimatedRecoveryTime, 60)
• Operator: <=
• Value: 600
227
CHAPTER 15 ■ MONITORING AVAILABILITY GROUPS
This condition as shown in Figure 15-12 will fail when the potential failover time exceeds 600 seconds
(i.e., 10 minutes), including a 60-second overhead for both failure detection and failover.
Next, you will create the third condition. To create the third condition, right-click Database Replica State
facet, select New Condition, and use the below specifications:
• Name: RPO
• Field: @EstimatedDataLoss
• Operator: <=
• Value: 3600
This condition as shown in Figure 15-13 will fail when the potential data loss exceeds 3600 seconds
(i.e., 1 hour).
228
CHAPTER 15 ■ MONITORING AVAILABILITY GROUPS
229
CHAPTER 15 ■ MONITORING AVAILABILITY GROUPS
Figure 15-14. Creating new policy for the Database Replica State facet
230
CHAPTER 15 ■ MONITORING AVAILABILITY GROUPS
Description Page:
• Category: Availability database warnings. This setting will display the policy
evaluation results on the Always On dashboard.
• Description: The current availability replica has a Recovery Time Objective (RTO)
that exceeded 10 minutes, assuming an overhead of one minute for failure detection
and failover. Troubleshoot performance issues on the respective SQL Server instance
immediately.
• Text to display: Recovery Time Objective (RTO) exceeded!
The description page for the RTOPolicy is shown in Figure 15-16.
231
CHAPTER 15 ■ MONITORING AVAILABILITY GROUPS
To create the second policy, right-click Database Replica State facet, select New Policy, and use the
below specifications to create the RTOPolicy:
General Page:
• Name: RPOPolicy
• Check condition: RPO
• Against targets: Every DatabaseReplicaState in IsPrimaryReplica AvailabilityGroup
• Evaluation mode: On schedule
• Schedule: CollectorSchedule_Every_30_min
• Enabled: checked
The general page for the RPOPolicy is shown in Figure 15-17.
232
CHAPTER 15 ■ MONITORING AVAILABILITY GROUPS
Description Page:
• Category: Availability database warnings.
• Description: The availability database has exceeded your Recovery Point Objective
(RPO) of 1 hour. Troubleshoot performance issues on the availability replicas
immediately.
• Text to display: Recovery Point Objective (RPO) exceeded!
The description page for the RPOPolicy is shown in Figure 15-18.
233
CHAPTER 15 ■ MONITORING AVAILABILITY GROUPS
After completing the above three steps, you will have three new conditions; two new policies; and two
new SQL Server Agent jobs, one for each policy evaluation schedule as shown in Figure 15-19. These jobs
have names that start with syspolicy_check_schedule.
234
CHAPTER 15 ■ MONITORING AVAILABILITY GROUPS
Figure 15-19. Reviewing the conditions, policies, and jobs created for monitoring RTO and RPO metrics
235
CHAPTER 15 ■ MONITORING AVAILABILITY GROUPS
Figure 15-20. Enabling user-defined Always On Policy checkbox to display the messages on the dashboard
Now the Always On dashboard will pick up our RTOPolicy and RPOPolicy.
■ Tip If the Always On dashboard is already open, then refresh the dashboard or close the dashboard and
reopen it to display the new policies.
236
CHAPTER 15 ■ MONITORING AVAILABILITY GROUPS
Click on the Warnings (1) hyperlink to display the issue details as shown in Figure 15-22.
Figure 15-22. Reviewing the detailed results for a failed policy in Always On dashboard
237
CHAPTER 15 ■ MONITORING AVAILABILITY GROUPS
This session is preconfigured to capture a subset of the Always On related events. This session is
automatically started on every participating availability replica if you created the availability group using
the New Availability Group wizard. The green arrow next to the AlwaysOn_health session indicates that it is
currently running.
■ Note If you created the availability group using some other methods (example T-SQL or PowerShell), the
AlwaysOn_health extended event session may not be automatically started. To start the AlwaysOn_health
session and configure it to start automatically, you can use the GUI or execute the below T-SQL script.
238
CHAPTER 15 ■ MONITORING AVAILABILITY GROUPS
239
CHAPTER 15 ■ MONITORING AVAILABILITY GROUPS
240
CHAPTER 15 ■ MONITORING AVAILABILITY GROUPS
To view all the Always On extended events along with a brief description, execute the below T-SQL
statement:
SELECT name, description FROM sys.dm_xe_objects WHERE name LIKE '%hadr%' order by name;
241
CHAPTER 15 ■ MONITORING AVAILABILITY GROUPS
There are other ways to view the data that is captured from an extended event session. In SSMS, Object
Explorer, right-click the target node under the event session node and select View Target Data as shown
below in Figure 15-28. This will open the Always On Health Events screen but the data will not be updated as
new data is reported by the events. But you can click View Target Data again to get the latest data.
Figure 15-28. Viewing AlwaysOn_health extended event_file using SSMS Object Explorer
242
CHAPTER 15 ■ MONITORING AVAILABILITY GROUPS
If you want to watch the data as it continues to arrive in real time, you can right-click the event session
node and select Watch Live Data as shown below in Figure 15-29.
Figure 15-29. Viewing live AlwaysOn_health extended events data using SSMS Object Explorer
If you want to view the AlwaysOn_health extended events using T-SQL, then you can use the T-SQL
function sys.fn_xe_file_target_read_file as shown below. You can click a cell in the event_data_XML
column and/or copy the long XML string from the event_data column and paste into your text editor
(example Notepad), save the file with extension .xml and then open the .xml file with a browser.
SELECT
object_name,
file_name,
file_offset,
event_data,
CAST(event_data AS XML) AS [event_data_XML]
FROM
243
CHAPTER 15 ■ MONITORING AVAILABILITY GROUPS
sys.fn_xe_file_target_read_file(
'ENTER FULL PATH OF YOUR AlwaysOn_health event file name',
null, null, null
);
Using PowerShell
SQL Server 2016 Always On Availability Groups provides a set of PowerShell cmdlets that enable us to
deploy, manage, and monitor availability groups, availability replicas, and availability databases. The
following PowerShell cmdlets are available to monitor the health of availability groups:
Test-SqlAvailabilityGroup – assesses the health of an availability group by evaluating SQL Server
PBM policies. As an example, the below command shows all availability groups with a health state of Error
on the server instance AlwaysOnN1\Default
Get-ChildItem SQLSERVER:\Sql\AlwaysOnN1\Default\AvailabilityGroups\AGCorp\
DatabaseReplicaStates | Test-SqlDatabaseReplicaState
Figure 15-30 shows the PowerShell cmdlets used to monitor the health of the availability group AGCorp
and the corresponding results.
244
CHAPTER 15 ■ MONITORING AVAILABILITY GROUPS
Using Alerts
You can use the in-built alerts feature in SQL Server Agent to monitor availability groups. As you might be
aware, events are generated by SQL Server and entered into Windows application log. SQL Server Agent
reads the application log and compares the events in the log with the alerts that you create. When a match is
found it fires an alert, which basically is an automated response to an event. SQL Server Agent also monitors
performance conditions and Windows Management Instrumentation (WMI) events.
You can create the following alerts to monitor the health of the availability groups.
• Alerts over Always On errors – To get a complete list of Always On system error
messages, run the below T-SQL command. This will return 17 rows in SQL Server 2016.
• Alerts over other relevant errors – You can create alerts on non-Always On errors too
if they are relevant and need to be monitored to ensure the health of the availability
group. To get a complete list of errors that can be monitored for an availability group,
go to the properties of AlwaysOn_health extended events session, click on Events,
click Configure and select the event error_reported.
• Alerts over Always On Perfmon counters – We discussed some of the important
Performance monitor counters earlier in this chapter. You can create alerts to
monitor these counters.
• Alerts over custom conditions – You can create alerts for specific conditions that need
to be evaluated. For example, to get an alert if our recovery time objective (RTO) will
be exceeded, you can create a SQL Server Agent job that divides Redo Queue Size
by Redo Rate and throw an error if the result exceeds the RTO value. You can then
configure an alert to capture these errors and notify the operators accordingly.
245
CHAPTER 15 ■ MONITORING AVAILABILITY GROUPS
■ Note The preceding script captures two Performance Monitor counters from the SQLServer:Database Replica
object. The first one is Transaction Delay that shows the wait time for an unterminated commit acknowledgement
(in milliseconds). The second one is Mirrored Write Transactions/sec that shows the number of transactions that
are waiting to be hardened to the primary because synchronous availability mode requires that they harden at the
secondary also. Dividing these two counters tells us by how many milliseconds are the transactions on the primary
replica database being delayed by one or more synchronous-commit secondary replicas.
246
CHAPTER 15 ■ MONITORING AVAILABILITY GROUPS
knowledge with embedded inline tasks, and views that enable near real-time diagnosis and resolution of
detected issues. Here are some of the monitoring features that are relevant to availability groups:
• Easily track your Always On Availability Groups inventory by automatically
discovering availability groups, availability replicas and availability databases from
hundreds of computers.
• Enable faster resolution to problems by utilizing SCOM’s alerting and ticketing
feature.
• Create custom tasks to manage availability groups from the SCOM console.
• Easily roll up health from availability databases to availability replicas.
Summary
In this chapter, we reviewed multiple ways to monitor the health and synchronization status of Always
On Availability Groups. It’s good to know about all the in-built SQL Server monitoring tools that you have
at your disposal. Sometimes, you may only need to use one tool while other times you may have to use
multiple tools to gather as much data as possible to troubleshoot a problem. In the next chapter, we will
discuss troubleshooting techniques and review some common troubleshooting scenarios that we see at our
customer sites in the field.
247
CHAPTER 16
Troubleshooting Availability
Groups
This chapter discusses how to use different logs to troubleshoot errors encountered during availability group
setup and usage. You will also learn how to troubleshoot some of the common availability group failures and
performance issues.
An unplanned failover of an availability group can happen for various reasons. The cause is usually
outside of SQL Server. SQL Server is just an application that is using external resources to perform its duties
and availability group is one of its components. Hence, unless the error message clearly indicates that the
issue is due to SQL Server, start troubleshooting in the following order:
1. Hardware
2. Operating system
3. Network
4. Security
5. Windows Server failover cluster
6. SQL Server
When an outage occurs, the first step is to define what actually went wrong. Ask the following questions:
• What resource(s) failed?
• Was it the disk? Or was it the network?
• Was it the SQL Server? Or was it the SQL Server Agent?
• Was it the file share?
• Were the users not able to connect even when the SQL Server instance was online?
• Did the problem occur on one node or all nodes?
To determine the root cause of the failure, determine exactly what failed, on which nodes, and what
actions the cluster performed to try to resolve the problem.
Always On Dashboard
For a quick overview of the availability group, look at the Always On dashboard. It provides information on
the availability group states and other performance indicators allowing you to easily make high availability
operational decisions using the following types of information:
• Replica state
• Synchronization mode and state
• Estimate Data Loss
• Estimated Recovery Time (redo catch up)
• Database Replica details
• Time to restore log
• And more…
This is a centralized report and the replica status can be viewed from the primary replica itself. You do
not need to go to each replica to view the state of that replica.
Figure 16-1 shows the Always On dashboard in the SSMS GUI on Primary Replica instance. As you can
see there is a lot of useful information on the dashboard. Important information and report links have been
highlighted.
250
CHAPTER 16 ■ TROUBLESHOOTING AVAILABILITY GROUPS
Figure 16-2 shows the Always On dashboard in the SSMS GUI on Primary Replica instance. More
columns can be added to the dashboard result by clicking on the Add/Remove columns option above the
column header and then selecting the columns to add. The below figure shows the options available for the
first result set on the dashboard. The columns highlighted in red are some important columns that you may
want to consider adding to the dashboard.
Figure 16-2. Columns available to add to the first result set in the Always On Dashboard
Figure 16-3 shows the options available for the second result set on the dashboard. The columns
highlighted in red are some important columns that you may want to consider adding to the dashboard.
251
CHAPTER 16 ■ TROUBLESHOOTING AVAILABILITY GROUPS
Figure 16-3. Columns available to add to the second result set in the Always On Dashboard
DMVs
There are various DMVs and catalog views (that start with sys.dm_hadr_ and sys.availability), which expose
lot of useful information related to availability group troubleshooting. Here are some common ones:
• sys.dm_hadr_availability_group_states - Displays the states that define the health
of a given availability group
• sys.dm_hadr_availability_replica_states – Displays information about the state of a
given availability replica
• sys.dm_hadr_database_replica_states - Returns a row for each database that is
participating in an availability group
• sys.dm_hadr_cluster – Displays information about the cluster and the quorum
• sys.dm_hadr_cluster_members - Returns a row for each of the members that
constitute the quorum and the state of each of them
• sys.availability_groups_cluster – Displays information on the availability group
metadata from the WSFC cluster
• sys.availability_replicas - Returns a row for each of the availability replicas that
belong to any availability group in the WSFC failover cluster
252
CHAPTER 16 ■ TROUBLESHOOTING AVAILABILITY GROUPS
253
CHAPTER 16 ■ TROUBLESHOOTING AVAILABILITY GROUPS
■ Note If you created the availability group using the wizard, then this session is automatically started on
every participating availability replica. If the availability group wasn’t created using the wizard, then this session
should be manually started.
Event Logs
The Cluster Service adds messages about the resource status to the cluster log and the system-event log. The
system-event log is generally easier to read than the cluster logs. You can usually identify failed resources by
searching for event ID 1069 in the system-event log.
Cluster Log
If the system-event log has no information about the cause of the failure (for example, the system stops
writing events to the log), use the cluster log. The Cluster.log files can be found in the <systemroot>\cluster\
reports directory (usually c:\windows\cluster\Reports) on each node. Get-ClusterLog is the Windows
PowerShell cmdlet that will generate the cluster.log on each server that is a member of the cluster and is
currently running.
■ Note The cluster log is verbose and complex. It should be the last place you go, not the first.
254
CHAPTER 16 ■ TROUBLESHOOTING AVAILABILITY GROUPS
■ Note Cluster log timestamps are in GMT, while event log timestamps are in local time.
Use NET HELPMSG to decipher error codes or ERR.EXE. Start at the bottom and work your way
upwards searching for the following:
• [ERR]
• -->failed – resource failure
• -->failed – group failure
Figure 16-6 shows the net helmsg command to help decipher error codes logged in the cluster log.
255
CHAPTER 16 ■ TROUBLESHOOTING AVAILABILITY GROUPS
256
CHAPTER 16 ■ TROUBLESHOOTING AVAILABILITY GROUPS
Figure 16-8 shows the results from the preceding SQL query.
257
CHAPTER 16 ■ TROUBLESHOOTING AVAILABILITY GROUPS
• In SQL 2012 and SQL 2014, as GMSAs are not supported, configure the replica
SQL instances to run under the same domain user account. This user account will
be automatically added to the master database in each SQL instance and granted
CONNECT permissions. This will simplify the security configuration.
• If you need to run both SQL instances under different domain accounts for any
reason, you then have to add each domain account as a login under the other SQL
instance. You also need to manually grant this login CONNECT permissions to the
database mirroring endpoin of the oher SQL Server instancet.
• If you are using non-domain accounts (local accounts or SIDs) as SQL service
startup accounts, you must use certificates for the ENDPOINT authentication.
• If you are using Domain Independent availability groups in SQL Server
2016 then, synchronize the service accounts on all proposed cluster nodes.
That means that there must be an account with the same name and the
same password, on each node in the cluster. Certificates need to be used for
ENDPOINT authentication.
258
CHAPTER 16 ■ TROUBLESHOOTING AVAILABILITY GROUPS
259
CHAPTER 16 ■ TROUBLESHOOTING AVAILABILITY GROUPS
To resolve this issue, either pre-stage the VCO in the computers OU or grant the following permissions
to the CNO in the computers OU:
• Read all properties
• Create Computer objects
The listener creation may also fail if the same name and/or IP addresses are being used elsewhere in
the same network. Resolve any such conflicts and ensure that you are using a unique network name and IP
addresses for the listener. Please note that only the first 15 characters of the network name are considered for
uniqueness.
Figure 16-12 shows the permissions required for the CNO in the Active Directory.
260
CHAPTER 16 ■ TROUBLESHOOTING AVAILABILITY GROUPS
Failover Troubleshooting
Availability groups can be configured for automatic or manual failover. Automatic failovers are triggered if a
health issue is detected on the primary replica. Not all failovers might work as expected. Let’s take a look at
some common failover scenarios.
■ Note Failover readiness of an availability group can be determined by querying the “is_failover_ready”
column of the sys.dm_hadr_database_replica_cluster_states DMV.
Figure 16-13 shows the failover readiness state of various replicas. In the below example, AlwaysOnN2
is ready for failover. However, AlwaysOnN3 replica is not.
■ Note Prior to SQL 2016, only one out of the two synchronous replicas could be configured for automatic
failover. Starting with SQL Server 2016, both synchronous secondary replicas can be configured as automatic
failover partners with the primary replica. Even if one failover partner is lost, high availability can still be
maintained in this case.
262
CHAPTER 16 ■ TROUBLESHOOTING AVAILABILITY GROUPS
■ Note Before you tweak this value, try to fix the underlying cause for the failover. Increasing this value may
only cause the availability group to fail more times before transitioning to a failed state.
263
CHAPTER 16 ■ TROUBLESHOOTING AVAILABILITY GROUPS
Figure 16-15 shows the availability group cluster resource property. In a three-node cluster the
maximum failures in the specified period would be two.
264
CHAPTER 16 ■ TROUBLESHOOTING AVAILABILITY GROUPS
265
CHAPTER 16 ■ TROUBLESHOOTING AVAILABILITY GROUPS
■ Note Information on the TCP Chimney Offload, Receive Side Scaling can be found in this KB article
- http://support.microsoft.com/kb/951037.
Also consider tuning the cluster heartbeat settings. There are two settings that
affect cluster heartbeat and health detection between nodes (i.e., delay and
threshold).
Delay defines the frequency at which cluster heartbeats are sent between nodes.
Threshold defines the number of heartbeats that are missed before the cluster
takes recovery action.
Table 16-1 shows the default and maxium values for the delay and threshold parameters in the same
subnet, cross subnet and cross site for Windows Server 2012 R2 and Windows Server 2016.
266
CHAPTER 16 ■ TROUBLESHOOTING AVAILABILITY GROUPS
Figure 16-18 shows the command to view current heartbeat configuration values.
Figure 16-19 shows the command to change the heartbeat configuration values.
267
CHAPTER 16 ■ TROUBLESHOOTING AVAILABILITY GROUPS
268
CHAPTER 16 ■ TROUBLESHOOTING AVAILABILITY GROUPS
Connectivity Failure
Following are some connectivity scenarios that you may encounter.
■ Note Always On availability groups set the Listener cluster resource RegisterAllProvidersIP
property to 1 by default. .NET Framework 4.6.1 provides faster availability group connection by including
“MultiSubnetFailover=True” in connection strings by default.
269
CHAPTER 16 ■ TROUBLESHOOTING AVAILABILITY GROUPS
• Connect to the listener, if the listener is configured with a non-default port, then
specify the port number in the connection string.
• Add the ApplicationIntent=ReadOnly connection parameter in the connection
string.
• Add the “Initial Catalog” or “Database” connection parameter and set its value to
any replica database for that availability group.
• Ensure that the login that you are using in the connection string exists on the
secondary replica instance that the listener will route you to.
• If the availability group is spread over multiple subnets, ensure that the
“MultiSubnetFailover=True” parameter is specified in the connection string.
270
CHAPTER 16 ■ TROUBLESHOOTING AVAILABILITY GROUPS
271
CHAPTER 16 ■ TROUBLESHOOTING AVAILABILITY GROUPS
• There are two useful PerfMon counters, Flow control/sec and Flow Control Time
(ms/sec) (from the SQLServer:Availability Replica object), which show you, within
the last second, how many times flow control was activated and how much time was
spent waiting on flow control. If the value for these counters is significantly above the
baseline values, it may mean that the secondary is overwhelmed with the workload it
is processing. In such case, you should troubleshoot the secondary to see if there are
any performance bottlenecks. You could also reduce the read workload (if any) on
the secondary replica and if possible provide additional resources to the secondary.
• If you do not see the flow control getting activated, check the primary replica for
unusually high activity. Examples of activities causing a high number of transaction
log blocks to be generated and sent to secondary replicas are index rebuilds,
bulk operations, etc. If such activities are going on primary, they may need to be
postponed to off-peak hours. If that is not possible then additional resources may
need to be provided so that the environment can sustain such high workload.
Before doing so, ensure that the high workload on primary is causing other
secondary replicas to also fall behind the primary replica. If that is not the case, then
troubleshoot the secondary replica in question further.
• If the primary replica doesn’t show any unusually high activity, troubleshoot the
secondary replica in question further with regards to performance bottleneck.
Figure 16-22 shows the workflow to troubleshoot secondary falling behind primary scenario.
272
CHAPTER 16 ■ TROUBLESHOOTING AVAILABILITY GROUPS
Summary
In this chapter we covered some of the common availability group failures and performance issues and how
to go about troubleshooting them. In the next chapter we will take a look at Microsoft Azure.
273
PART VII
So far in this book, we have discussed deploying SQL Server Always On Availability Groups that run on-
premises in your organization. There are situations wherein you are unable to satisfy all the high availability
(HA) and disaster recovery (DR) requirements and meet the service-level agreement (SLA) that the
business unit requires for the SQL Server applications. In this chapter, we introduce you to Microsoft Azure,
which assists you in meeting all the HADR requirements and stay within budget by deploying Always On
Availability Groups as a hybrid solution or an Azure-only solution. In a hybrid environment, part of the
availability group solution runs on-premises and the rest runs in Azure. In an Azure-only solution, the entire
availability group runs in Microsoft Azure.
■ Note Azure was released as Windows Azure in February 2010 and was renamed Microsoft Azure in
March 2014.
Microsoft Azure offers a collection of services that any organization can use or leverage for their IT
solutions like file servers, database servers, virtual machines, web apps, etc., in their public cloud offering.
Public cloud is cloud services that are available to anyone via the public Internet. The services are located
all over the globe through Microsoft-managed global datacenters. You might be wondering why you need a
public cloud when a private cloud can be created relatively easily.
One of the biggest advantages is that Microsoft Azure allows you to perform virtually any compute or
data storage operation by provisioning and scaling the necessary resources on demand and on a pay-as-
you-go basis. Microsoft Azure not only allows you to scale your services dynamically to handle the increased
demand that you may run into unexpectedly, it also involves descaling your services when demands shrink
or traffic patterns slow down. It allows for immediate provisioning and de-provisioning and you have to pay
only for what you use.
Microsoft Azure offers a world-wide solution. At the time of writing this chapter, Microsoft Azure has
datacenters that operate in 24 regions around the world which is more than the other hyperscale cloud
service providers (CSPs) Amazon Web Services (AWS), and Google Cloud combined. This allows us to
deliver our services in a fast, low-latency fashion no matter where our customers are. It exposes the services
to anyone anywhere they need to work from or anywhere they need to access the services from.
■ Note Microsoft Azure is the first multinational cloud provider in mainland China.
You may be thinking, “All this is good, but is my data safe in the cloud?” In fact, this is one of the most
frequently asked questions by our customers who are thinking of leveraging the cloud features. Microsoft
Azure leads the industry in protecting the privacy of your data. It was the first CSP to adopt ISO 27018 that
is the new international standard for cloud privacy. Also, it was the first CSP to be recognized by European
Union’s data protection authorities for its commitment to comply with rigorous EU privacy laws.
Another common question that we often get is, “We have made a big investment on our on-premises
datacenters. Can we still use our resources in our on-premises datacenters and leverage the cloud?”
Microsoft Azure is hybrid ready. It can seamlessly connect to your existing IT environment so that you can
leverage the resources in your datacenters and within Microsoft Azure itself. You can use a wide set of hybrid
connections, for example virtual private networks (VPNs), ExpressRoute connections, content delivery
networks (CDNs), to ensure performance and ease of use.
278
CHAPTER 17 ■ INTRODUCTION TO MICROSOFT AZURE
Microsoft Azure is an open and flexible CSP. It supports the broadest selection of devices, operating
systems, databases, programming languages, frameworks, and tools. It supports the same technologies most
businesses trust and use today, including Windows, Linux, virtual machines and containers, and Active
directory. This means you can use what you already know and there’s no steep learning curve with Microsoft
Azure.
Microsoft Azure offers enterprise grade SLAs on services, 24/7 tech support, and round-the-clock health
monitoring. This is the reason more than 66% of Fortune 500 companies rely on Microsoft Azure. The web
site https://azure.microsoft.com/en-us/support/legal/sla/ describes Microsoft’s commitments for
uptime and connectivity and lists the SLA for individual services.
■ Note Microsoft Azure is recognized by Gartner, a leading independent research firm, as a leader across six
of their Magic Quadrants for enterprise cloud workloads for the second consecutive year. Visit https://azure.
microsoft.com/en-us/campaigns/magic-quadrant/ to view Gartner’s Magic Quadrant reports.
279
CHAPTER 17 ■ INTRODUCTION TO MICROSOFT AZURE
In a traditional on-premises environment, you manage, own, and are responsible for everything from
networking up through the applications. Private cloud is the first evolution wherein you automate and own
the resources from virtualization down to the networking. This allows you to take the most advantage of the
resources you have, control them in ways you can, and reallocate them in the best possible way.
The next evolution is the Infrastructure as a Service (IaaS). IaaS looks into the pieces that can be
automated and moves them over to Microsoft Azure. Microsoft Azure takes responsibility of the networking,
storage, servers, and virtualization components. In short, IaaS gives you a server in the cloud (virtual
machine). This frees up your time and allows you to work on the more important tasks like making sure that
the applications are functioning correctly. You have complete control and are responsible for managing
everything from the operating system on up to the application you are running. Some typical scenarios that
organizations use IaaS for include testing and development, web site hosting, storage, backup and recovery,
web apps, high-performance computing, and big data analysis.
The next evolution is the Platform as a Service (PaaS). Here you focus only on application and data,
that is, the things that really generate revenue for the business and let Microsoft Azure manage the other
components starting from networking up through the runtime. PaaS is a complete development and
deployment environment in the cloud that allows us to deliver everything from a simple cloud-based
application to a sophisticated, cloud-based enterprise application. Like IaaS, PaaS includes infrastructure
(storage, servers) and networking and also operating system, middleware, development tools, database
management systems, business intelligence (BI) services, and more. This allows you to avoid the expense
and complexity of buying and managing the infrastructure, software licenses, middleware, development
tools, and more. It allows you to manage the applications and services and Microsoft Azure manages
everything else. Some typical scenarios that organizations use PaaS for include development framework and
analytics or business intelligence.
The next evolution is the Software as a Service (SaaS) where you let Microsoft Azure manage everything
right from the networking layer up to the application layer. SaaS allows users to connect to and use cloud-
based applications over the Internet. SaaS provides a complete software solution that you can purchase on
as pay-as-you-go basis from a CSP. You rent the application for your organization, and your users connect
to it over the Internet. This allows your organization to get quickly up and running with an application at
minimal upfront costs. Some examples of SaaS are Microsoft Exchange Online, Microsoft SharePoint Online,
and Microsoft Dynamics CRM Online.
■ Note You do not have to use these offerings in isolation. You can combine IaaS, PaaS, and SaaS. You can
also combine your on-premises IT infrastructure with Microsoft Azure to create a hybrid solution. This enables
you to meet unique and diverse business needs by combining on-premises and cloud-hosted deployments,
while using the same set of servers, tools, and expertise across these environments.
Microsoft Azure has two options for hosting SQL Server workloads:
• SQL Server on Azure virtual machines (IaaS) – As the name suggests, here SQL
Server is installed on Microsoft Azure virtual machines hosted in the cloud. This
falls into the industry category IaaS. Microsoft provides an availability SLA of
99.95% that covers just the VM. This SLA does not cover the processes such as SQL
Server running on the VM and requires that you host at least two VM instances
in an availability set. For SQL Server HA within VMs, you should configure one of
the supported HA options in SQL Server, such as Always On Availability Groups as
discussed in detail in chapter 18.
280
CHAPTER 17 ■ INTRODUCTION TO MICROSOFT AZURE
Summary
In this chapter, we introduced you to Microsoft Azure and discussed the three main categories of Microsoft
Azure compute offerings, namely, Infrastructure as a Service (IaaS), Platform as a Service (PaaS), and
Software as a Service (SaaS). We also discussed the options to host SQL Server workloads using Microsoft
Azure. In the next chapter we will help you get started with Microsoft Azure for your Always On Availability
Groups deployment.
281
CHAPTER 18
Always On Availability Groups on Microsoft Azure can be created manually or by using the Microsoft Azure
Marketplace image. Azure Marketplace provides preconfigured templates for software deployment in
Azure. It automates the deployment of an SQL Server Always On Availability Group for high availability of
SQL Server. It provisions two SQL Server replicas (primary and secondary) and one witness file share in a
Windows Cluster. It also provisions two Domain Controller replicas (primary and secondary). In addition, it
configures an availability group listener for clients to connect to the primary SQL Server replica.
This template deploys an Always On Availability Group such that after deployment is complete, the
user has a fully available availability group. The template implements performance, security, and availability
best practices. This template is called SQL Server Always On Cluster. The name is misleading as it gives
the impression that you would be installing an SQL Cluster. However, the fact is that this template would
deploy Always On Availability Groups for you in Microsoft Azure. The Always On label for the clustering
and availability group technology has already caused a lot of confusion and calling the availability groups
deployment template on Azure as an Always On cluster just adds to the confusion. Let’s walk through the
steps for deploying availability groups on Azure using the Marketplace image.
284
CHAPTER 18 ■ AVAILABILITY GROUPS IN MICROSOFT AZURE
Figure 18-2 shows the SQL Server Always On Cluster template that shows up in the search results.
285
CHAPTER 18 ■ AVAILABILITY GROUPS IN MICROSOFT AZURE
Resource Group: It is a container that holds related resources for an application. It helps you to deploy,
manage, and monitor all of the resources for your solution as a group, rather than handling these resources
individually. So this new availability group deployment will be grouped under a resource group. Later on,
you will have the option of specifying a new resource group or an existing resource group.
286
CHAPTER 18 ■ AVAILABILITY GROUPS IN MICROSOFT AZURE
■ Note Resource Manager is the recommended deployment model for new virtual machines. The SQL
Server Always On Cluster deployment template does not allow you to change the deployment model and uses
resource manager as the default.
Availability Set: During downtime such as maintenance, an availability set helps keep the virtual
machines available. When two or more similarly configured virtual machines are placed in an availability
set, it creates the redundancy needed to maintain availability of the applications or services that the
virtual machine runs. For example, in this availability group deployment, DC1 and DC2 are placed in
one Availability Set, whereas SQL 1, SQL 2 and the File share witness are placed in another. Hence during
maintenance, DC1 and DC2 will not be unavailable at the same time; likewise SQL 1, SQL 2, and FSW will
not be down at the same time.
Internal Load Balancer (ILB): The availability group listener name is mapped to a load-balanced IP
address and Azure’s load balancer directs the incoming traffic to only the primary server in the replica set.
The ILB can only be accessed by resources inside the cloud service or via VPN that connects to the Azure
infrastructure. This enables internal applications to run in Azure and be accessed within the cloud or from
on-premises.
After reviewing the deployment architecture, click on Create.
287
CHAPTER 18 ■ AVAILABILITY GROUPS IN MICROSOFT AZURE
■ Note The administrator account will be for all the virtual machines. It will also be the domain administrator
for the new domain deployment.
Using a new resource group is helpful if you are just testing or learning about SQL Server deployments in Azure.
After you finish with your test, delete the resource group to automatically delete the VMs and all resources
associated with that resource group.
288
CHAPTER 18 ■ AVAILABILITY GROUPS IN MICROSOFT AZURE
Click OK.
The domain controllers and the SQL Servers will be on separate subnets. All the VMs in the
deployments will be under one virtual network.
289
CHAPTER 18 ■ AVAILABILITY GROUPS IN MICROSOFT AZURE
■ Note The virtual network address range will be 10.0.0.0/16. The DC subnet address prefix will be
10.0.0.0/24, whereas the SQL subnet address prefix will be 10.0.1.0/26.
Click OK.
290
CHAPTER 18 ■ AVAILABILITY GROUPS IN MICROSOFT AZURE
A listener will be created with the availability group listener name that you provide.
■ Note If you choose to use port 1433, you do not need to provide a port number in a connection string. If
you choose a port other than 1433 for your listener port, you will also need to explicitly specify a target port in
your connection string.
Click OK.
291
CHAPTER 18 ■ AVAILABILITY GROUPS IN MICROSOFT AZURE
Click on the SQL Server virtual machine size option. The portal opens the size blade on the right side.
The standard VM sizes consist of several series: A, D, DS, F, Fs, G, and GS.
D-series VMs are designed to run applications that demand higher compute power and temporary disk
performance.
• Faster processors
• Higher memory-to-core ratio
• Solid-state drive (SSD) for the temporary disk
Dv2-series is a follow-on to the original D-series, features a more powerful CPU.
• About 35% faster than the D-series CPU.
292
CHAPTER 18 ■ AVAILABILITY GROUPS IN MICROSOFT AZURE
• Based on the latest generation 2.4 GHz Intel Xeon® E5-2673 v3 (Haswell) processor
with the Intel Turbo Boost Technology 2.0, which can go up to 3.1 GHz.
• Same memory and disk configurations as the D-series.
F-series provides the best value in price-performance in the Azure portfolio based on the Azure
Compute Unit (ACU) per core
• Based on the 2.4 GHz Intel Xeon® E5-2673 v3 (Haswell) processor, which can achieve
clock speeds as high as 3.1 GHz with the Intel Turbo Boost Technology 2.0.
• Same CPU performance as the Dv2-series of VMs.
G-series
• Most memory
• Run on hosts that have Intel Xeon E5 V3 family processors
DS-series
• DSv2-series, Fs-series, and GS-series VMs can use Premium Storage, which provides
high-performance, low-latency storage for I/O intensive workloads.
• Use solid-state drives (SSDs) to host a virtual machine’s disks and also provide a
local SSD disk cache.
• Premium Storage is available in certain regions.
A-series
• Can be deployed on a variety of hardware types and processors.
• The size is throttled, based upon the hardware, to offer consistent processor
performance for the running instance, regardless of the hardware it is deployed on.
A0
• Size is over-subscribed on the physical hardware.
• Other customer deployments may impact the performance of your running
workload.
■ Note The size of the virtual machine affects the pricing, processing, memory, and storage capacity of the
virtual machine.
293
CHAPTER 18 ■ AVAILABILITY GROUPS IN MICROSOFT AZURE
The blade initially displays recommended machine sizes. It also estimates the monthly cost to run
the VM. The virtual machine size is the Azure virtual machine size for both SQL Servers. Choose a virtual
machine size appropriate for your workload. For production workloads choose a virtual machine size that
can support the workload. Many production workloads will require DS4 or larger. The template will build
two virtual machines of this size and install SQL Server on each one. Click on the View All button, which
shows all machine size options.
■ Note The Marketplace image will install the Enterprise Edition of SQL Server.
■ Note At the time of writing this book, the Marketplace image provided the option for choosing SQL VMs
with only premium disk support. As a result, the only SQL storage type available was Premium locally redundant
storage (Premium_LRS). We will be discussing the all storage types in detail, later in the chapter.
Now click on the Domain controller virtual machine size option. The portal opens the size blade on the
right side.
294
CHAPTER 18 ■ AVAILABILITY GROUPS IN MICROSOFT AZURE
Like the SQL VM size blade, this blade initially displays recommended machine sizes. It also estimates
the monthly cost to run the VM. Click on the view all button, which shows all machine size options.
Choose the machine size, and then click Select.
Now click on the File Share Witness virtual machine size option. The portal opens the size blade on the
right side.
Figure 18-11 shows the size blade.
295
CHAPTER 18 ■ AVAILABILITY GROUPS IN MICROSOFT AZURE
Figure 18-11. Choose a size for the File Share Witness VMs
296
CHAPTER 18 ■ AVAILABILITY GROUPS IN MICROSOFT AZURE
Provide a unique name for the storage account and click on Type. The portal opens the choose storage
type blade.
Figure 18-13 shows the storage account creation blade.
297
CHAPTER 18 ■ AVAILABILITY GROUPS IN MICROSOFT AZURE
As mentioned earlier, the Marketplace image provided the option for choosing SQL VMs with only
premium disk support. As a result, the SQL storage type available is Premium_LRS. However, if you are
building the VMs manually then you can choose VMs without premium disk support, and that will make the
different storage types available.
■ Note The data in the storage account is always replicated to ensure durability and high availability,
meeting the Azure Storage SLA even in the face of transient hardware failures.
Types of storage:
Locally redundant storage (LRS)
Data is replicated within the region in which the stored account is created. The data is replicated three
times. A transaction is considered successful only once it has been written to all three replicas. These three
replicas each reside in separate fault domains (FD) and upgrade domains (UD). The three replicas are
spread across UDs and FDs to ensure that data is available even if hardware failure impacts a single rack and
when nodes are upgraded during a rollout.
298
CHAPTER 18 ■ AVAILABILITY GROUPS IN MICROSOFT AZURE
■ Note FD is a group of nodes that represent a physical unit of failure and can be considered as nodes
belonging to the same physical rack.
UD is a group of nodes that are upgraded together during the process of a service upgrade.
■ Note Write data is replicated asynchronously to the secondary region. Since asynchronous replication
involves a delay, in the event of a regional disaster it is possible that changes that have not yet been replicated
to the secondary region may be lost if the data cannot be recovered from the primary region.
299
CHAPTER 18 ■ AVAILABILITY GROUPS IN MICROSOFT AZURE
Next choose the Storage optimization option for your workload. All the SQL Servers in this deployment
use premium storage. In addition, you can optimize SQL Server settings for the workload by choosing one of
these three settings:
• General workload sets no specific configuration settings
• Transactional processing enables trace flag 1117 and 1118 on the SQL Server
instances
• Data warehousing sets trace flag 1117 and 610 on the SQL Server instances
300
CHAPTER 18 ■ AVAILABILITY GROUPS IN MICROSOFT AZURE
Figure 18-15 shows the storage optimization in the VM size and storage blade.
Figure 18-15. Select the storage optimization option for the workload
301
CHAPTER 18 ■ AVAILABILITY GROUPS IN MICROSOFT AZURE
SQL Server Name Prefix is used to create a name for each SQL Server. If the default contoso-ag is used,
then the SQL Server names will be Contoso-ag-0 and Contoso-ag-1.
SQL Server version is the version of SQL Server that you want to use.
SQL Server service account user name is the domain account name for the SQL Server service.
Password is the password for the SQL Server service account.
SQL Auto Patching maintenance schedule identifies the weekday that Azure will automatically patch the
SQL Servers.
SQL Auto Patching maintenance start hour is the time of day for the Azure region when automatic
patching will begin.
After providing and selecting the required options, click OK.
The portal opens the summary blade as Azure validates the settings.
Figure 18-17 shows the summary page for review.
302
CHAPTER 18 ■ AVAILABILITY GROUPS IN MICROSOFT AZURE
303
CHAPTER 18 ■ AVAILABILITY GROUPS IN MICROSOFT AZURE
Figure 18-18. Buy page with terms of use and privacy policy
304
CHAPTER 18 ■ AVAILABILITY GROUPS IN MICROSOFT AZURE
• If you are not using premium storage, then disable Azure geo-redundant storage
(geo-replication) on the storage account.
• Disks
a. Use at least two premium Storage disks (one for log and the other for data
and tempdb).
b. The temporary storage drive (D drive), is not persisted to Azure blob
storage. Hence user database files or transaction log files should not be
stored on the temporary storage drive.
c. If using premium disks, enable read caching on data and TempDB data
disks.
d. Do not enable caching on transaction log disks.
e. Do not enable caching if not using premium disks.
• Stripe multiple Azure data disks to get increased IO throughput.
a. For more throughput, add additional data disks and use disk Striping.
b. Analyze the number of IOPS available for the data and log disks to
determine the number of data disks.
c. For Windows 8/Windows Server 2012 or later, use Storage Spaces.
■ Note Storage Spaces is a new functionality introduced in Windows Server 2012 R2. It allows provisioning
storage based on a pooled model. It also provides sophisticated storage virtualization capabilities.
305
CHAPTER 18 ■ AVAILABILITY GROUPS IN MICROSOFT AZURE
306
CHAPTER 18 ■ AVAILABILITY GROUPS IN MICROSOFT AZURE
On clicking the Add Azure Replica button, it opens up the Add Azure Replica page.
Figure 18-20 shows the Add Azure Replica page.
After signing in to the Azure account, select the New VM image, size and provide the name and user
credential information to be created. Also select virtual network and subnet for Azure connectivity and
provide the on-premises domain name and user credentials.
Summary
The information in this chapter should help you get started with Microsoft Azure for your Always On
Availability Groups deployment. Throughout this book, we have seen the amazing improvements that
have been made to Always On Availability Groups in SQL Server 2016 and the ability to easily deploy it to
Microsoft Azure, providing unparalleled flexibility. We have thoroughly enjoyed writing this book, and we
hope that it has empowered you to put the knowledge gained from this book to practical use.
307
Index
310
■ INDEX
311
■ INDEX
Readable secondary replica (cont.) SQL Server integration services (SSIS), 147
routing list, 157–158 SQL Server Management
row versioning, 160 Studio (SSMS), 153, 166–168, 222
running database integrity, 174–175 SQL Server settings blade, 301
SQL Server Management Studio, 153–154 SQL Server setup programs, 192
TSQL, 154 SQL Server version, 302
Read-access geo-redundant storage (RA-GRS), 299 SQL VM size blade, 293
Read-only routing, 155–157, 161–162 SSMS. See SQL Server Management Studio (SSMS)
Read-only routing fails, 269 Storage account blade, 297–298
Receive side scaling (RSS), 266 Storage optimization option, 301
Recovery level objective (RLO), 8 Stripe multiple Azure data, 305
Recovery point objective (RPO), 7, 14, 223 Suspend Data Movement, 179–180
Recovery time objective (RTO), 7, 14, 223, 231 Synchronous-commit mode, 22, 36–37
REDO thread operation, 159 System Center Operations
Release to manufacturing (RTM), 192 Manager (SCOM), 246
Replication
merge, 13
peer-to-peer, 13 T
snapshot, 12 Tdetection, 223
transactional, 13 Test-SqlAvailabilityGroup, 244
Resource migration phase, 203 Test-SqlDatabaseReplicaState, 244
Resume Database Movement, 181 Transactional replication, 13
RLO. See Recovery Level Objective (RLO) Transaction Delay performance, 217–218
Rolling upgrade process, 191, 193, 195–196 Transact-SQL (TSQL), 154
RPO. See Recovery point objective (RPO) Troubleshoot errors
RPOPolicy, 229 Always On dashboard, 251
RTO. See Recovery time objective (RTO) Cluster Log, 254
RTOPolicy, 229, 231 DMVs, 252
Event Logs, 254
Extended Event logs, 253
S production environment, 270
Secondary replica reports and logs, 249
backup preference and priority, 166 secondary falling, 272
backup types, 164 SQL Server, 249
PowerShell, 169–173 SQL Server Error Log, 254
primary replica, 164 SSMS GUI, 251
SSMS (see (SQL Server Management Studio types of information, 250
(SSMS))) TSQL. See Transact-SQL (TSQL))
transaction log backups, 165–166
TSQL, 168
SNAC. See SQL Native Client (SNAC) U
Snapshot replication, 12 Upgrading Operating System, 199
Software as a Service (SaaS), 279
Solid-state drives (SSDs), 293
SQL FCI, 27–28 V
SQL Native Client (SNAC), 155 VCOs. See Virtual computer objects (VCOs)
SQL Server Virtual computer objects (VCOs), 78, 259
upgrading and updating, 191 Virtual network name (VNN), 10
SQL Server 2014, 15 Virtual private network (VPN), 29
SQL Server 2016, 15–16 VM size and the storage settings, 292
SQL Server and Windows Server, 192 VNN. See Virtual network name (VNN)
SQL Server Error Log, 254 VPN. See Virtual private network (VPN)
312
■ INDEX
313