fix(certgen): bundle previous CA during cert rotation to prevent mTLS disruption#8534
fix(certgen): bundle previous CA during cert rotation to prevent mTLS disruption#8534OliverBailey wants to merge 2 commits intoenvoyproxy:mainfrom
Conversation
✅ Deploy Preview for cerulean-figolla-1f9435 canceled.
|
3e4c937 to
d122acc
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #8534 +/- ##
==========================================
+ Coverage 74.14% 74.15% +0.01%
==========================================
Files 242 242
Lines 37749 37784 +35
==========================================
+ Hits 27989 28020 +31
- Misses 7806 7808 +2
- Partials 1954 1956 +2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
… disruption When certgen --overwrite rotates certificates, the ca.crt field of each control-plane Secret was replaced atomically with the new CA. Kubernetes propagates Secret updates to pods via the kubelet volume sync loop, and Envoy reloads its xDS TLS context via SDS: neither is instantaneous. During the convergence window, pods that have picked up a new leaf cert (signed by the new CA) are rejected by peers that still hold only the old CA in their trust store, causing mTLS authentication failures. This is the backwards-incompatible rotation problem described in envoyproxy#4891 and reproduced on v1.6.1 by users in that thread. Fix: when updating an existing Secret that already contains a ca.crt, bundle the outgoing CA together with the incoming CA so that every component trusts both during the transition. Concretely, CreateOrUpdate Secrets now calls bundleCACerts(newCA, oldCA) which: 1. Starts the bundle with all certs from newCA (the freshly generated CA). 2. Appends the first non-expired, non-duplicate cert from oldCA (the CA that was active at the previous rotation). 3. Skips any further certs from oldCA. The cap of one carry-over cert keeps the bundle at a maximum of two entries regardless of how frequently rotations occur. The reasoning is: by the time an operator runs certgen --overwrite a second time, all components (kubelet sync period + SDS reload) will have converged on the certs written during the first rotation. The CA from two rotations ago is therefore never needed, and carrying it forward indefinitely would cause unbounded bundle growth for long-lived CAs (e.g. the default 5-year lifetime). The single carry-over is dropped automatically at the rotation after it would have been needed. The HMAC secret (envoy-oidc-hmac) carries no ca.crt and is unaffected. Fixes envoyproxy#4891 (partial — Rate Limit CA hot-reload addressed separately) Signed-off-by: Oliver Bailey <github@obailey.co.uk>
226d66f to
47031b5
Compare
|
|
Thanks for the questions; good ones @arkodg Does rotation impact existing connections? No. TLS/mTLS verification only happens at the handshake. An established connection that completed its handshake before rotation doesn't get re-verified and won't be disrupted. For the Envoy ↔ Envoy Gateway xDS gRPC stream this is a single long-lived connection per Envoy pod, so rotation alone won't break anything in flight. During the rotation window, won't the slower peer be unable to verify a newer cert? You're right. The bundle is a targeted improvement, not a complete solution. What What it doesn't cover: if the already-updated pod is presenting its new leaf cert (signed by The mitigating factors are:
A fully race-free approach would require a two-phase rotation: push only the updated |
|
hey trying to understand the permanent failure case, which client was unable to setup a connection during the rotation change ? is this envoy-ratelimit ? does ratelimit service use updated cert material if its ConfigMap and Secrets have been updated ? |
|
cc @rajatvig |
|
Yes, this might help fix the case when using the Certgen Job construct vs the custom Certificate setup. The core of the problem is that when old CA is renewed but not leafs, the bundle needs to carry both the certs so that the leaf certs can still be validated. The theory that once the new CA is refreshed, this would work. The other idea I have been trying to wrangle is watching the CA secret for changes and triggering leaf certificate renewals when that happens. |
Envoy is the client that can't establish the connection — Rate Limit (the server) rejects it because it doesn't hot-reload its CA cert. The three components behave differently after the kubelet syncs the updated Secret volume:
The failure sequence is: kubelet syncs → Envoy SDS-reloads its new leaf cert (signed by the new CA) → Rate Limit still holds the old CA in memory → mTLS handshakes from Envoy to Rate Limit are rejected. The only recovery without intervention is a pod restart. This PR addresses the Envoy ↔ EG convergence window via the CA bundle. The Rate Limit side is handled in a follow-up PR ( |
Exactly right — the bundle is specifically for the convergence window where the new CA is in place but not all pods have synced yet, so old leaf certs still need to be verifiable. The CA-watch → leaf renewal idea is a nice proactive complement to this. Rather than tolerating the mismatch window it would eliminate it entirely. It's a bigger change though — needs a controller watching the CA Secret and re-invoking cert generation. Worth tracking as a follow-up issue; happy to open one once these two PRs land. |
|
@OliverBailey can we fix the issue in ratelimit repo to pick up the newer certs when the files change ? |
|
@arkodg Yeah, that works with me. You want me to make that change, or yourselves to pick it up? |
|
Would be great if you can drive that change in the rate limit repo |
Summary
Fixes #4891 (partial — Rate Limit CA hot-reload addressed in a follow-up PR)
Problem
When
certgen --overwriterotates certificates,ca.crtin each control-plane Secret is replaced atomically with the new CA. Two propagation mechanisms are at play after that write:Neither is synchronous. During the convergence window a pod that has picked up a new leaf cert (signed by the new CA) is rejected by a peer that still holds only the old CA in its trust store, causing mTLS authentication failures. This is precisely the incident reproduced on v1.6.1 described in the issue thread.
Fix
When updating an existing Secret that already contains a
ca.crt, bundle the outgoing CA together with the incoming CA so that every component trusts both during the transition.bundleCACerts(newCA, oldCA):newCA(the freshly generated CA).oldCA— the CA active at the previous rotation.break).Why a maximum of two CAs
Carrying forward only one previous CA keeps the bundle at exactly two entries regardless of rotation frequency.
By the time an operator runs
certgen --overwritea second time, all components will have converged on the certs written during the first rotation (kubelet sync + SDS reload happen within seconds to a minute). The CA from two rotations ago is therefore never needed in practice. Carrying it forward indefinitely would cause unbounded bundle growth for long-lived CAs — the default lifetime is 5 years. The single carry-over is naturally dropped at the rotation after it would have been needed.Scope
envoy-oidc-hmacSecret carries noca.crtand is unaffected.fix/ratelimit-ca-restart→ this branch). Rate Limit does not watch its CA file for changes; that PR triggers a rolling restart of the Rate Limit Deployment after rotation.Testing
Added
TestCreateOrUpdateSecretsBundlesCAandTestBundleCACertscovering: