100% found this document useful (1 vote)
154 views7 pages

IT Alert Operations: Standard Operating Procedure

This standard operating procedure document describes the alert operations for an RDSProxy back-end and other elements. It provides descriptions, dashboard links, alert definitions, symptoms and recovery processes. For the RDSProxy back-end, the alert is triggered when 3 health checks fail, notifying MySQL administrators. If unacknowledged for 30 minutes, senior DBAs are escalated. The recovery process involves manually checking and re-enabling the backend host in HAProxy.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
154 views7 pages

IT Alert Operations: Standard Operating Procedure

This standard operating procedure document describes the alert operations for an RDSProxy back-end and other elements. It provides descriptions, dashboard links, alert definitions, symptoms and recovery processes. For the RDSProxy back-end, the alert is triggered when 3 health checks fail, notifying MySQL administrators. If unacknowledged for 30 minutes, senior DBAs are escalated. The recovery process involves manually checking and re-enabling the backend host in HAProxy.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 7

IT Alert Operations

STANDARD OPERATING PROCEDURE

Author
[COMPANY NAME] | [COMPANY ADDRESS]
9/24/2019 1:40:00 AM
T ABLE OF C ONTENTS
RDSProxy Back-end ........................................................................................................................................................1
Description.................................................................................................................................................................1
Dashboard Links.....................................................................................................................................................1
Alert Definition ..........................................................................................................................................................1
State: Down ...........................................................................................................................................................1
Symptoms ..................................................................................................................................................................1
Recovery Process .......................................................................................................................................................1
[Element] .......................................................................................................................................................................2
Description.................................................................................................................................................................2
Dashboard Links.....................................................................................................................................................2
Alert Definition ..........................................................................................................................................................2
State [Warning/Critical/Down/Unreachable] ........................................................................................................2
Symptoms ..................................................................................................................................................................2
Recovery Process .......................................................................................................................................................2
Version Date Editor

1
RDSP ROXY B ACK - END

D ESCRIPTION

RDSProxy is just an instance running HAProxy in TCP-Proxy mode (wherein it binds a locally listening socket to a
remote socket on a “back-end” host and steps away, allowing the native transmission to occur on the wire).
HAProxy monitors the “Back-end” RDS instances by making a MySQL client connection to them using the
haproxy_check user.

D ASHBOARD L INKS
RDSProxy Dashboard
MySQL Dashboard

A LERT D EFINITION

S TATE : D OWN

T RIGGER
In the event that 3 sequential health checks fail for a given back-end RDS instance, HAProxy marks that system
"down" and sends no additional traffic to it. Once the back-end server is marked down. HAPROXY will not attempt
to re-enable it. You must do this manually.

N OTIFICATION
Team: MySQL Administrators

E SCALATION
If two or more systems alert with this message, escalate immediately to Senior DBAs.
If not acknowledged/resolved within 30 minutes, escalate notification to Senior DBAs

R ESET C ONDITION
Health check reports status as “up”

S YMPTOMS

Remote calls to the instance in question may result in slow returns of results. Multiple RDSProxy failures will affect
performance.

R ECOVERY P ROCESS

Manually check HAProxy health check:

mysql --user=haproxy_check --host=ha_host

Manually re-enable the Back-end host

hactl enable server ha_host/ha_service health up


hactl set server ha_host/ha_service health up

Follow the haproxy.log file to validate that the back-end host has been re-established.

tail -f /var/log/haproxy.log

1
[E LEMENT ]

D ESCRIPTION

Description of the element affected and what it entails – include plain language description of the element and its
role in the organization.

D ASHBOARD L INKS
Links to a dashboard for monitoring the service

A LERT D EFINITION

S TATE [W ARNING /C RITICAL /D OWN /U NREACHABLE ]

T RIGGER
Trigger conditions for the above state. Repeat the “State” header with additional statuses if this element can
trigger more than one state.

N OTIFICATION
Who will get the notifications and which transport will be used.

E SCALATION
If there are any escalation paths, define here.

R ESET C ONDITION
How do we know that this is resolved?

S YMPTOMS

What are the symptoms seen by IT, end-users, external services, etc.?

R ECOVERY P ROCESS

How do you recover from this alert? If things are done automatically via the NMS, define them here.

You might also like