Allow all autodisabled webhooks to self-heal, but permanently disable after 40 concurrent failures
About
Currently webhooks that result in 5xx
errors can self-heal after being autodisabled, but 4xx
errors lead to permanent disabling.
5xx
are disabled for a minute on the 4th failure, then for 2 minutes on the 5th failure, 4 minutes on the 6th and so on up until a max of 1 day.
But 4xx
errors are handled differently though, and they become permanently disabled on the 4th error, rather than temporarily.
Problem
However, there are times when a 4xx
error is a temporary state.
Consider, a 404
may happen in operation - while we deploy/migrate/initialize. The application/route/ingress may get removed temporarily and k8s, for instance, will give 404
on that. This will resolve after a period of time, however, any webhooks hitting the server will not self-heal.
Another scenario could be where a server temporarily rejects a hook as unauthorized (403
) due to a temporary bug when assessing the basic auth credentials. Once the bug on the server is fixed, the webhook should ideally automatically self-heal and start firing again sometime later.
Solution
Treat 4x
errors the same as 5xx
, allowing them to self-heal.
Additionally, we will also make all failures lead to the webhook being permanently disabled after 40
concurrent failures #503733 (comment 2217234805).
The feature is therefore:
- All webhooks can self-heal
- All webhooks are permanently disabled after 40 consecutive failures
Click to read original description
Also related to #388990 (closed)
The unexpected introduction of Automatic disabling of failing webhooks in 15.7 caused a lot of stress in our outomation ecosystem. Suddenly we had many users contacting us that automation is broken. It was even worse because of the buggy UI which did not enable users to self-service re-enable the hooks easily - #391184
I can see how strict webhook availability policy is necessary in SaaS environment, and how it causes troubles in self-service. The two should probably be treated differently. We run GitLab and we run the hooked services so we police ourselves. I must say it appears rather harsh to break self-hosted environments for the sake of protecting SaaS. I would have expected more careful handling.
Running systems near their peak performance and still working out high availability, occasional down times are inevitable and our expectation is that the dependent systems recover. I still think it's a neat idea to inform users that their webhooks are broken, though.
We need GitLab to operate autonomously as much as possible. It should self-heal. When the obvious UI issues are addressed, I'd propose that the rules for disabling web hooks should be reconsidered. Until then we have to disable this new feature. And since that is not even an option yet - #388486 we had to hack together a script that re-enables the disabled web hooks.
Proposal A: Internal errors are not all the same. If DNS resolution failed or the TCP port was closed or IP was unreachable, it would likely be on the configuration side. If the call failed due to network error (which may also manifest temporarily as the previous set) or a timeout, then it's a temporary error that should be recovered. If you can make this distinction, disabling could get smarter.
Proposal B: "Remember" that a given webhook configuration was working at least once (with setting a "working" flag on the record or a date). If it did, treat the subsequent errors as temporary, retry. Or retry with a longer back-off period? When the webhook configuration changes, reset the "working" flag.
Initially I found (A) more adequate, but reviewing it I thing it would be a bit unpredictable or hard to understand for users at least. (B) would be plain obvious especially if the working flag made it to the UI. Both variants could be implemented after all.
Proposal B differs from the current implementation in treating all errors the same - 4xx or 5xx or network/DNS error, all the same. If the web-hook worked AT LEAST ONCE with one given configuration, then it is retried on failure with back-off. If the given configuration never worked (regardless of failure mode) only wpuld the web-hook be disabled. I don't know how useful that would be in your SaaS situation. Increasing the back-off period incrementally up to one week might be as effective as disabling it in the large scale. I know what happens to hobbyist opensource projects... It would work better for self hosted, though, should we ever reintroduce this feature.
Consider, 404 may also happen in operation - while we deploy/migrate/initialize. The application/route/ingress may get removed temporarily and k8s, for instance, will give 404 on that.
It would also be useful, even when auto-disabling is disabled, that the project users get alarmed with that nice fat red banner that their webhooks are failing. I found that useful. Apparently, there is a similar feature request, but only when viewing the webhook in its admin #391184.
Implementation Guide
A detailed implementation guide is written here #396577 (comment 2113987044).
The basics of the guide are:
In a first MR:
We would remove the :failed
category from WebHookService
and any failure would be categorised as :error
.
And then in WebHooks::LogExecutionService
we could remove the handling of :failed
as all types of failures would be categorised as :error
, so we would always #backoff!
.
We would add a data migration to migrate existing permanently disabled webhooks to be temporarily disabled.
In a second MR
We would remove the unused code around permanently disabled webhooks.