SIGUSR2: zero downtime restart #4624

daipom · 2024-08-30T07:02:14Z

Which issue(s) this PR fixes:

Fixes Update/Reload without downtime #4622

What this PR does / why we need it:
This replaces the current SIGUSR2 (#2716) with the new feature.
(Not supported on Windows).

Restart the new process with zero downtime

The primary motivation is to enable the update of Fluentd without data loss of plugins such as in_udp.
(Please See #4622).

Specification:

2 ways to trigger this feature (non-Windows):
- Signal: SIGUSR2 to the supervisor.
  - Sending SIGUSR2 to the workers triggers the traditional GracefulReload.
    - (Leave the traditional way, just in case)
- RPC: /api/processes.zeroDowntimeRestart
  - Leave /api/config.gracefulReload for the traditional feature.
This starts the new supervisor and workers with zero downtime for some plugins.
- Input plugins with zero_downtime_restart supported work in parallel.
  - Supported input plugins:
    - in_tcp
    - in_udp
    - in_syslog
- The old processes stop after 10s.
The new supervisor works in source-only mode (Add with-source-only feature #4661) until the old processes stop.
- After the old processes stop, the data handled by the new processes are loaded and processed.
- If need, you can configure source_only_buffer (see Add with-source-only feature #4661).
Windows: Not affected at all. Remains the traditional GracefulReload.

Mechanism

The supervisor receives SIGUSR2.
Spawn a new supervisor.
Take over shared sockets.
Launch new workers, and stop old processes in parallel.
- Launch new workers with source-only mode
  - Limit to zero_downtime_restart_ready? input plugin
- Send SIGTERM to the old supervisor after 10s delay from 3.
The old supervisor stops and sends SIGWINCH to the new one.
The new workers run fully.

Needs following:

Conditions under which zero_downtime_restart_ready? can be enabled:

Must be able to work in parallel with another Fluentd instance.
Notes:
- The sockets provided by server helper are shared with the new Fluentd instance.
- Input plugins managing a position such as in_tail should not enable its zero_downtime_restart_ready?.
  - Such input plugins do not cause data loss on restart, so there is no need to enable this in the first place.
- in_http and in_forward could also be supported. Not supporting them this time is simply a matter of time to consider.

The appropriateness of replacing the traditional SIGUSR2

There are the following reasons:

The traditional SIGUSR2 feature has some limitations and issues.
- Limitations:
  it has 2 limitations.
  1. A change to system_config is ignored because it needs to restart(kill/spawn) process.
  2. All plugins must not use class variable when restarting.
- Issues:
This new feature allows restarts without downtime and such limitations.
- Although supported plugins are limited, that is not a problem for many plugins.
  (The problem is with server-based input plugins where the stop results in data loss).
This new feature has a big advantage that it can also be used to update Fluentd.
- In the future, fluent-package will use this feature to allow update with zero downtime by default.
  - Update/Reload without downtime fluent-package-builder#713
If needed, we can still use the traditional feature by RPC or directly sending SIGUSR2 to the workers.

ETC

Docs Changes:
TODO

Release Note:

Add zero-downtime-restart feature for non-Windows (USR2 signal and /api/processes.zeroDowntimeRestart RPC API)

TODO:

Some implementation TODO referred in code comment.
Tests
Document

daipom · 2024-10-11T01:06:02Z

The basic implementation is done.
Some concept of #4654 is reflected. Thanks @Watson1978!

Use this to test this feature. * fluent/fluentd#4624 Signed-off-by: Daijiro Fukuda <[email protected]>

lib/fluent/plugin/in_syslog.rb

lib/fluent/root_agent.rb

Use this to test this feature. * fluent/fluentd#4624 Signed-off-by: Daijiro Fukuda <[email protected]>

lib/fluent/supervisor.rb

daipom · 2024-11-27T02:55:57Z

Thanks for your review!

lib/fluent/root_agent.rb

kenhys · 2024-11-27T02:59:20Z

during zeroDowntimeRetart, other HTTP endpoints result in non-guarded state. it it intentional?

daipom · 2024-11-27T03:33:48Z

during zeroDowntimeRetart, other HTTP endpoints result in non-guarded state. it it intentional?

Yes.
The old Fluentd should continue to work as is until it receives SIGTERM at 4..
(Even if the new Fluentd does not work as expected).

The new Fluentd RPC starts at 5., so there is no conflict.

If the old Fluentd receives /api/processes.killWorkers, it causes just a quick transition to 5..

This replaces the current `SIGUSR2` (#2716) with the new feature. (Not supported on Windows). * Restart the new process with zero downtime The primary motivation is to enable the update of Fluentd without data loss of plugins such as `in_udp`. Specification: * 2 ways to trigger this feature (non-Windows): * Signal: `SIGUSR2` to the supervisor. * Sending `SIGUSR2` to the workers triggers the traditional GracefulReload. * (Leave the traditional way, just in case) * RPC: `/api/processes.zeroDowntimeRestart` * Leave `/api/config.gracefulReload` for the traditional feature. * This starts the new supervisor and workers with zero downtime for some plugins. * Input plugins with `zero_downtime_restart` supported work in parallel. * Supported input plugins: * `in_tcp` * `in_udp` * `in_syslog` * The old processes stop after 10s. * The new supervisor works in `source-only` mode (#4661) until the old processes stop. * After the old processes stop, the data handled by the new processes are loaded and processed. * If need, you can configure `source_only_buffer` (see #4661). * Windows: Not affected at all. Remains the traditional GracefulReload. Mechanism: 1. The supervisor receives SIGUSR2. 2. Spawn a new supervisor. 3. Take over shared sockets. 4. Launch new workers, and stop old processes in parallel. * Launch new workers with source-only mode * Limit to zero_downtime_restart_ready? input plugin * Send SIGTERM to the old supervisor after 10s delay from 3. 5. The old supervisor stops and sends SIGWINCH to the new one. 6. The new workers run fully. Note: need these feature * #4661 * treasure-data/serverengine#146 Conditions under which `zero_downtime_restart_ready?` can be enabled: * Must be able to work in parallel with another Fluentd instance. * Notes: * The sockets provided by server helper are shared with the new Fluentd instance. * Input plugins managing a position such as `in_tail` should not enable its `zero_downtime_restart_ready?`. * Such input plugins do not cause data loss on restart, so there is no need to enable this in the first place. * `in_http` and `in_forward` could also be supported. Not supporting them this time is simply a matter of time to consider. The appropriateness of replacing the traditional SIGUSR2: * The traditional SIGUSR2 feature has some limitations and issues. * Limitations: 1. A change to system_config is ignored because it needs to restart(kill/spawn) process. 2. All plugins must not use class variable when restarting. * Issues: * #2259 * #3469 * #3549 * This new feature allows restarts without downtime and such limitations. * Although supported plugins are limited, that is not a problem for many plugins. (The problem is with server-based input plugins where the stop results in data loss). * This new feature has a big advantage that it can also be used to update Fluentd. * In the future, fluent-package will use this feature to allow update with zero downtime by default. * If needed, we can still use the traditional feature by RPC or directly sending `SIGUSR2` to the workers. Co-authored-by: Shizuo Fujita <[email protected]> Co-authored-by: Kentaro Hayashi <[email protected]> Signed-off-by: Daijiro Fukuda <[email protected]>

kenhys

LGTM.

daipom · 2024-11-28T04:48:43Z

Thanks for your review!

Use this to test this feature. * fluent/fluentd#4624 Signed-off-by: Daijiro Fukuda <[email protected]>

daipom mentioned this pull request Aug 30, 2024

Update/Reload without downtime #4622

Closed

daipom changed the title ~~Restart without downtime~~ Update/Restart without downtime Aug 30, 2024

daipom changed the title ~~Update/Restart without downtime~~ Update/Reload without downtime Aug 30, 2024

daipom self-assigned this Aug 30, 2024

daipom force-pushed the restart-without-downtime branch from 41fd042 to d0f31e8 Compare October 3, 2024 01:58

Watson1978 mentioned this pull request Oct 3, 2024

[PoC] Update/Reload without downtime #4654

Closed

daipom force-pushed the restart-without-downtime branch from d0f31e8 to 630f809 Compare October 11, 2024 00:59

daipom force-pushed the restart-without-downtime branch from 630f809 to 1cbbd9a Compare October 11, 2024 01:12

daipom mentioned this pull request Oct 11, 2024

[PoC] Update/Reload without downtime treasure-data/serverengine#149

Closed

daipom added this to the v1.18.0 milestone Oct 11, 2024

daipom mentioned this pull request Oct 30, 2024

use Fluentd for the feature fluent/fluent-package-builder#700

Merged

daipom force-pushed the restart-without-downtime branch from 1cbbd9a to f8755d0 Compare October 31, 2024 03:06

daipom added a commit to fluent/fluent-package-builder that referenced this pull request Oct 31, 2024

use Fluentd for the feature (#700)

a1bc198

Use this to test this feature. * fluent/fluentd#4624 Signed-off-by: Daijiro Fukuda <[email protected]>

kenhys pushed a commit to fluent/fluent-package-builder that referenced this pull request Nov 5, 2024

use Fluentd for the feature (#700)

953caa3

Use this to test this feature. * fluent/fluentd#4624 Signed-off-by: Daijiro Fukuda <[email protected]>

daipom mentioned this pull request Nov 5, 2024

Update/Reload without downtime fluent/fluent-package-builder#713

Draft

Watson1978 reviewed Nov 14, 2024

View reviewed changes

lib/fluent/plugin/in_syslog.rb Outdated Show resolved Hide resolved

Watson1978 reviewed Nov 14, 2024

View reviewed changes

lib/fluent/root_agent.rb Outdated Show resolved Hide resolved

kenhys pushed a commit to fluent/fluent-package-builder that referenced this pull request Nov 19, 2024

use Fluentd for the feature (#700)

0642709

Use this to test this feature. * fluent/fluentd#4624 Signed-off-by: Daijiro Fukuda <[email protected]>

daipom force-pushed the restart-without-downtime branch 10 times, most recently from 1a7f2e0 to feda2ea Compare November 25, 2024 10:12

daipom changed the title ~~Update/Reload without downtime~~ GracefulReload(SIGUSR2): Restart new process with zero downtime Nov 25, 2024

daipom force-pushed the restart-without-downtime branch 2 times, most recently from 9a2e188 to 873bf29 Compare November 25, 2024 16:08

daipom marked this pull request as ready for review November 25, 2024 16:13

daipom requested a review from Watson1978 November 26, 2024 01:48

daipom changed the title ~~GracefulReload(SIGUSR2): Restart new process with zero downtime~~ SIGUSR2: Restart new process with zero downtime Nov 26, 2024

daipom force-pushed the restart-without-downtime branch 2 times, most recently from 049b9f7 to c714d9c Compare November 26, 2024 03:14

daipom changed the title ~~SIGUSR2: Restart new process with zero downtime~~ SIGUSR2: zero downtime restart Nov 26, 2024

daipom requested a review from kenhys November 26, 2024 03:26

Watson1978 reviewed Nov 26, 2024

View reviewed changes

lib/fluent/supervisor.rb Outdated Show resolved Hide resolved

Watson1978 reviewed Nov 26, 2024

View reviewed changes

lib/fluent/supervisor.rb Outdated Show resolved Hide resolved

daipom force-pushed the restart-without-downtime branch 4 times, most recently from 1b52de8 to d7e68db Compare November 27, 2024 02:53

Watson1978 approved these changes Nov 27, 2024

View reviewed changes

kenhys requested changes Nov 27, 2024

View reviewed changes

lib/fluent/root_agent.rb Outdated Show resolved Hide resolved

lib/fluent/root_agent.rb Outdated Show resolved Hide resolved

lib/fluent/root_agent.rb Outdated Show resolved Hide resolved

lib/fluent/root_agent.rb Outdated Show resolved Hide resolved

daipom force-pushed the restart-without-downtime branch from d7e68db to e52d3bd Compare November 27, 2024 03:12

daipom force-pushed the restart-without-downtime branch from e52d3bd to 8e09c09 Compare November 27, 2024 03:53

daipom force-pushed the restart-without-downtime branch from 8e09c09 to d7164dd Compare November 27, 2024 04:36

kenhys approved these changes Nov 28, 2024

View reviewed changes

daipom merged commit d102527 into master Nov 28, 2024
17 of 18 checks passed

daipom deleted the restart-without-downtime branch November 28, 2024 04:48

daipom added a commit to daipom/fluent-package-builder that referenced this pull request Nov 29, 2024

use Fluentd for the feature (fluent#700)

1daab78

Use this to test this feature. * fluent/fluentd#4624 Signed-off-by: Daijiro Fukuda <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SIGUSR2: zero downtime restart #4624

SIGUSR2: zero downtime restart #4624

daipom commented Aug 30, 2024 •

edited

Loading

daipom commented Oct 11, 2024 •

edited

Loading

daipom commented Nov 27, 2024

kenhys commented Nov 27, 2024

daipom commented Nov 27, 2024

kenhys left a comment

daipom commented Nov 28, 2024

SIGUSR2: zero downtime restart #4624

SIGUSR2: zero downtime restart #4624

Conversation

daipom commented Aug 30, 2024 • edited Loading

Mechanism

The appropriateness of replacing the traditional SIGUSR2

ETC

daipom commented Oct 11, 2024 • edited Loading

daipom commented Nov 27, 2024

kenhys commented Nov 27, 2024

daipom commented Nov 27, 2024

kenhys left a comment

Choose a reason for hiding this comment

daipom commented Nov 28, 2024

daipom commented Aug 30, 2024 •

edited

Loading

daipom commented Oct 11, 2024 •

edited

Loading