resiliency_with_cloud_sql_whitepaper
resiliency_with_cloud_sql_whitepaper
Planned Downtime 03
Unplanned Downtime 07
Application considerations 20
for an HA deployment
1
Resiliency
with Cloud SQL
Cloud SQL is Google Cloud’s fully managed relational database service for MySQL, PostgreSQL, and SQL
Server. It provides full compatibility with the source database engines while reducing operations costs by
automating database provisioning, storage capacity management, and other time-consuming tasks. Cloud
SQL has built-in features to ensure business continuity with reliable and secure services, backed by a 24/7
SRE team providing a 99.95% SLA for the service.
Cloud SQL acts as the database backend for critical business applications deployed by enterprises
across the globe. Many of these applications need to be available 24x7, and a key component of these
applications is the database storing and managing the data used by the application. Ensuring the
availability of these applications and the underlying database is important as downtimes result in loss of
revenue, unhappy customers, and potential damage to an organization’s reputation.
This in-depth guide discusses the availability features of Cloud SQL and how the service handles planned
maintenance events and unplanned outages, including the advantages and controls available to help you
minimize application downtime.
We will cover the availability of single-zone Cloud SQL instances, the advantages of a regional HA Cloud
SQL deployment, and present an architecture to address disaster recovery requirements.
Each architecture builds on top of the previous deployment method while improving the availability
of the Cloud SQL service. In other words, the Cloud SQL HA architecture provides the capabilities of
a single-zone Cloud SQL instance deployment, plus additional availability characteristics. We will also
briefly discuss the requirements of the applications to manage various outage events.
Note: There may be some differences in the capabilities of the individual database engines in Cloud SQL.
Please check the Cloud SQL documentation for the specific database engine for details.
2
Factors impacting
database availability
All IT systems, including databases, can be subject
to unforeseen failures. Sometimes, they need to be
taken out of service for planned events like software
upgrades, security patches, and hardware and
firmware updates. A highly available deployment
architecture should ensure data protection that
provides zero to low RPO (Recovery Point Objective)
and low RTO (Recovery Time Objective) to ensure a
fast return to service in case of any type of outage.
There are two types of events that impact availability: planned downtime due to maintenance or other activities
and unplanned downtime due to various outage scenarios.
A planned outage or maintenance downtime is a proactive set of scheduled tasks carried out to improve
the database availability, performance, or the security of your Cloud SQL instance or its underlying operating
system. This is a scheduled event, and therefore, the impact on availability can be mitigated and controlled.
An unplanned outage, or unplanned downtime, is not scheduled and results from a component or system failure,
software bugs, or even human error. The database deployment architecture needs to be designed to handle
these outages.
Planned downtime
The planned events that can impact the availability
of a Cloud SQL instance fall into two categories:
Configuration updates
Scaling an instance up or down in response to changing workload profiles and setting database flags for the
respective database engines are types of planned events that you control and schedule, usually during a low
period of user activity
A Cloud SQL instance has three main components that may need to be scaled: storage, CPU, and RAM.
Cloud SQL storage can be set to increase dynamically, which means there is no downtime when storage is
added to the instance. If the available storage falls below a threshold size, Cloud SQL automatically adds
additional storage capacity of up to 64 TB to your instance. Having a database that grows as needed minimizes
application downtime by reducing the risk of running out of database space. You can take the guesswork out of
capacity sizing without incurring any downtime or performing database maintenance.
While storage size can be increased automatically without downtime, it cannot be decreased. The storage
increases are permanent for the life of the instance. There are methods to reduce the size of the database
(thereby reducing storage), which could incur some downtime. An example might include migrating to a new
instance using our Database Migration Service (DMS) with less storage allocated to the target instance.
4
Maintenance events
The second essential category of planned maintenance is software
maintenance. Given Cloud SQL is a managed service, it automatically
updates instances from time to time to ensure that the underlying
hardware, operating system, and database engine are reliable,
performant, secure, and up to date. We perform many of these
updates while the Cloud SQL instance is up and running.
To help, Cloud SQL has a unique feature that enables you to set Note: In very rare cases, Cloud SQL might need to
a deny maintenance period during these times. Setting up deny schedule maintenance outside of the maintenance
settings to patch severe stability issues or time-
maintenance periods, which come in blocks of up to 90 days,
sensitive vulnerabilities. These updates roll out
prevents Cloud SQL from performing automatic maintenance rapidly, and Cloud SQL counts them as downtime
during a deny period. When you configure a deny maintenance against the SLA.
period on your primary instance, maintenance for all replicas
associated with the primary instance is also denied.
Above is a screenshot of what the maintenance settings could look like for a customer environment.
It’s also important to consider the application’s behavior during maintenance windows. Although the downtime
during maintenance is very low, applications should be built to handle temporary errors. You should leverage
techniques like proxies or connection pooling to minimize the application impact of dropped connections to the
database. Also, applications should have error handling and retry logic with exponential backoff built in to handle
any query failures or connection drops during maintenance.
For more details, we recommend reading this article about how Cloud SQL maintenance works.
7
Unplanned downtime
While downtime related to maintenance events can be
controlled, unplanned outages can occur due to a variety of
reasons ranging from hardware failures and software bugs
to human error. More severe types of outages could be
the loss of an entire region or data center due to a natural
disaster. It’s vital to minimize the downtime associated with
these outages and recover with zero or minimal data loss.
The table below lists the high availability Cloud SQL solution
for various outage types for unplanned downtime, which we
will describe in more detail later.
Live Migration
(managed by Google Cloud Operations)
In a typical data center environment, any updates to the underlying physical infrastructure like swapping out a
defective machine, replacing an old or failing disk, performing BIOS updates, or similar have the potential to bring
down the instance undergoing hardware maintenance.
However, Google Cloud performs hardware updates without interruption to a user’s application. For example, when
updating a database server, Google Cloud uses live migration—an advanced technology that reliably migrates
a virtual machine (VM) from the original host to a new one while the VM stays running. There may be a short
brownout period during the live migration, and the application should be coded to handle errors and retry.
Live Migration is done on a best-effort basis. If hardware fails completely, or otherwise prevents live migration, the
VM automatically crashes and restarts.
9
Backup
and recovery
In the event of human error or data corruption,
Cloud SQL backups protect your data
from loss or damage. While backups are
a foundational element of an availability
strategy, it is important to test recovery using
the backups that have been taken.
1. Automated backups
2. On-demand backups
Below are some of the scenarios where you can leverage Backup
and Recovery to recover from a failure:
Cloud SQL
High Availability
The Cloud SQL High Availability configuration
provides next-level availability for Cloud SQL
instances. It builds on top of our foundational
availability features described above.
Region 1
Zone A
Servers Disks
Primary Cloud Persistent
instance SQL disk 01
Client Persistent
IP address Regional
application disk
Zone B
Servers Disks
Standby Cloud Persistent
instance SQL disk 02
15
The primary instance becomes less loaded when you offload read
workloads to a replica, keeping the primary instance more stable. You
can also promote a replica to become the primary instance if the original
is corrupted. However, this can lead to some data loss depending on the
replication lag between the primary instance and the replica.
It’s possible to create a replica in one, two, or all three of these locations
if needed. We recommend putting your replicas in a different zone. This
ensures that replicas will continue to operate even if there is an outage in
the zone that contains the primary instance.
Cloud SQL
disaster recovery
The Cloud SQL HA configuration provides To recover from this type of failure, you can set up
protection against zonal failures. However, a cross-region read replicas in a different region from
failure that affects an entire region, typically due where a primary is located.
to a natural disaster or multiple catastrophic
failures, could render both the primary and standby Below is an example of a Cloud SQL DR schematic:
instances unavailable.
Region A
Zone A Zone B
Asynchronous replication
Region B
Zone C
Application considerations
for an HA deployment
Cloud Functions
Clients
22
Active
Static IP Address
Standby
Read ???
Read
Read
Replica Replica
Replica
Standby instance
Synchronous
replication