-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SIGUSR2: zero downtime restart #4624
Conversation
41fd042
to
d0f31e8
Compare
d0f31e8
to
630f809
Compare
The basic implementation is done. |
630f809
to
1cbbd9a
Compare
1cbbd9a
to
f8755d0
Compare
Use this to test this feature. * fluent/fluentd#4624 Signed-off-by: Daijiro Fukuda <[email protected]>
Use this to test this feature. * fluent/fluentd#4624 Signed-off-by: Daijiro Fukuda <[email protected]>
Use this to test this feature. * fluent/fluentd#4624 Signed-off-by: Daijiro Fukuda <[email protected]>
1a7f2e0
to
feda2ea
Compare
9a2e188
to
873bf29
Compare
049b9f7
to
c714d9c
Compare
1b52de8
to
d7e68db
Compare
Thanks for your review! |
during zeroDowntimeRetart, other HTTP endpoints result in non-guarded state. it it intentional? |
d7e68db
to
e52d3bd
Compare
Yes. The new Fluentd RPC starts at If the old Fluentd receives |
e52d3bd
to
8e09c09
Compare
This replaces the current `SIGUSR2` (#2716) with the new feature. (Not supported on Windows). * Restart the new process with zero downtime The primary motivation is to enable the update of Fluentd without data loss of plugins such as `in_udp`. Specification: * 2 ways to trigger this feature (non-Windows): * Signal: `SIGUSR2` to the supervisor. * Sending `SIGUSR2` to the workers triggers the traditional GracefulReload. * (Leave the traditional way, just in case) * RPC: `/api/processes.zeroDowntimeRestart` * Leave `/api/config.gracefulReload` for the traditional feature. * This starts the new supervisor and workers with zero downtime for some plugins. * Input plugins with `zero_downtime_restart` supported work in parallel. * Supported input plugins: * `in_tcp` * `in_udp` * `in_syslog` * The old processes stop after 10s. * The new supervisor works in `source-only` mode (#4661) until the old processes stop. * After the old processes stop, the data handled by the new processes are loaded and processed. * If need, you can configure `source_only_buffer` (see #4661). * Windows: Not affected at all. Remains the traditional GracefulReload. Mechanism: 1. The supervisor receives SIGUSR2. 2. Spawn a new supervisor. 3. Take over shared sockets. 4. Launch new workers, and stop old processes in parallel. * Launch new workers with source-only mode * Limit to zero_downtime_restart_ready? input plugin * Send SIGTERM to the old supervisor after 10s delay from 3. 5. The old supervisor stops and sends SIGWINCH to the new one. 6. The new workers run fully. Note: need these feature * #4661 * treasure-data/serverengine#146 Conditions under which `zero_downtime_restart_ready?` can be enabled: * Must be able to work in parallel with another Fluentd instance. * Notes: * The sockets provided by server helper are shared with the new Fluentd instance. * Input plugins managing a position such as `in_tail` should not enable its `zero_downtime_restart_ready?`. * Such input plugins do not cause data loss on restart, so there is no need to enable this in the first place. * `in_http` and `in_forward` could also be supported. Not supporting them this time is simply a matter of time to consider. The appropriateness of replacing the traditional SIGUSR2: * The traditional SIGUSR2 feature has some limitations and issues. * Limitations: 1. A change to system_config is ignored because it needs to restart(kill/spawn) process. 2. All plugins must not use class variable when restarting. * Issues: * #2259 * #3469 * #3549 * This new feature allows restarts without downtime and such limitations. * Although supported plugins are limited, that is not a problem for many plugins. (The problem is with server-based input plugins where the stop results in data loss). * This new feature has a big advantage that it can also be used to update Fluentd. * In the future, fluent-package will use this feature to allow update with zero downtime by default. * If needed, we can still use the traditional feature by RPC or directly sending `SIGUSR2` to the workers. Co-authored-by: Shizuo Fujita <[email protected]> Co-authored-by: Kentaro Hayashi <[email protected]> Signed-off-by: Daijiro Fukuda <[email protected]>
8e09c09
to
d7164dd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Thanks for your review! |
Use this to test this feature. * fluent/fluentd#4624 Signed-off-by: Daijiro Fukuda <[email protected]>
Which issue(s) this PR fixes:
What this PR does / why we need it:
This replaces the current
SIGUSR2
(#2716) with the new feature.(Not supported on Windows).
The primary motivation is to enable the update of Fluentd without data loss of plugins such as
in_udp
.(Please See #4622).
Specification:
SIGUSR2
to the supervisor.SIGUSR2
to the workers triggers the traditional GracefulReload./api/processes.zeroDowntimeRestart
/api/config.gracefulReload
for the traditional feature.zero_downtime_restart
supported work in parallel.in_tcp
in_udp
in_syslog
source-only
mode (Add with-source-only feature #4661) until the old processes stop.source_only_buffer
(see Add with-source-only feature #4661).Mechanism
Needs following:
Conditions under which
zero_downtime_restart_ready?
can be enabled:in_tail
should not enable itszero_downtime_restart_ready?
.in_http
andin_forward
could also be supported. Not supporting them this time is simply a matter of time to consider.The appropriateness of replacing the traditional SIGUSR2
There are the following reasons:
(The problem is with server-based input plugins where the stop results in data loss).
SIGUSR2
to the workers.ETC
Docs Changes:
TODO
Release Note:
USR2
signal and/api/processes.zeroDowntimeRestart
RPC API)TODO: