Ephemeral Cluster Controller: Terminate ci-operator gracefully#5126
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: automatic mode |
WalkthroughReconciler split control-plane and per-cluster clients (masterClient + buildClients). When an EphemeralCluster is missing, reconciler lists ci-operator namespaces by job label and creates (or preserves) a deprovisioning signal secret in the test namespace instead of immediately aborting the ProwJob. Tests and imports updated to Changes
Sequence Diagram(s)sequenceDiagram
participant Reconciler as Reconciler (PJ)
participant Master as MasterClient
participant NSAPI as Namespace API
participant Build as BuildClient
participant SecretAPI as Secret API
participant Prov as Provisioning
Reconciler->>Master: Get EphemeralCluster(ref)
Master-->>Reconciler: NotFound
Reconciler->>NSAPI: List namespaces by steps.LabelJobID
NSAPI-->>Reconciler: ci-operator namespace (or none)
alt namespace found
Reconciler->>Build: buildClientFor(PJ.Spec.Cluster)
Build-->>Reconciler: client or error
Reconciler->>SecretAPI: Get Secret(EphemeralClusterTestDoneSignalSecretName)
SecretAPI-->>Reconciler: NotFound / Exists
alt secret missing
Reconciler->>SecretAPI: Create Secret (test-done signal)
SecretAPI-->>Reconciler: Created
SecretAPI->>Prov: Provisioning sees secret -> deprovision
else secret exists
Reconciler->>Reconciler: Leave secret in place
end
else namespace not found
Reconciler->>Master: Abort/Update ProwJob (set CompletionTime)
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 11 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (11 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: danilo-gemoli The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pkg/controller/ephemeralcluster/reconciler.go (1)
599-617:⚠️ Potential issue | 🔴 CriticalEC deletion still takes the old hard-abort path.
Line 617 still aborts any non-terminal ProwJob immediately. Because the finalizer keeps the
EphemeralClusteraround until this path finishes,pkg/controller/ephemeralcluster/prowjobreconciler.gonever sees the EC as missing during normal deletion, so the new graceful-termination flow is bypassed and ci-operator can still be killed mid-provisioning. Please route this branch through the same “signal deprovision if the ci-operator namespace exists, otherwise abort” logic instead of unconditionally callingabortProwJob.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/controller/ephemeralcluster/reconciler.go` around lines 599 - 617, The current branch unconditionally calls r.abortProwJob(&pj, AbortProwJobDeleteEC) for any non-terminal prowv1.ProwJob state, which forces the hard-abort path; instead, replicate the logic used in pkg/controller/ephemeralcluster/prowjobreconciler.go: check whether the ci-operator namespace for this EphemeralCluster exists (the same existence check used in prowjobreconciler), and if the namespace exists signal deprovision/graceful-termination (the deprovision signaling path), otherwise proceed to call r.abortProwJob to hard-abort; update the switch/default branch in reconciler.go to route through that conditional rather than unconditionally calling r.abortProwJob (keep removeFinalizer and the log.WithField("prowjob_name", pj.Name) unchanged).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/controller/ephemeralcluster/prowjobreconciler.go`:
- Around line 128-145: The secret creation has a TOCTOU race: keep the
Get/Create flow idempotent by treating AlreadyExists as success or by using a
create-or-update helper; specifically, in prowjobreconciler.go where you call
buildClient.Get(...) and then buildClient.Create(...) for
api.EphemeralClusterTestDoneSignalSecretName, catch
apierrors.IsAlreadyExists(err) from Create and treat it as non-fatal (log.Debug
and continue) or replace the Create with controllerutil.CreateOrUpdate (or
client.Patch/Update) to ensure idempotency; apply the same change in
notifyTestComplete in reconciler.go so both places accept AlreadyExists as
success or use the same create-or-update helper.
---
Outside diff comments:
In `@pkg/controller/ephemeralcluster/reconciler.go`:
- Around line 599-617: The current branch unconditionally calls
r.abortProwJob(&pj, AbortProwJobDeleteEC) for any non-terminal prowv1.ProwJob
state, which forces the hard-abort path; instead, replicate the logic used in
pkg/controller/ephemeralcluster/prowjobreconciler.go: check whether the
ci-operator namespace for this EphemeralCluster exists (the same existence check
used in prowjobreconciler), and if the namespace exists signal
deprovision/graceful-termination (the deprovision signaling path), otherwise
proceed to call r.abortProwJob to hard-abort; update the switch/default branch
in reconciler.go to route through that conditional rather than unconditionally
calling r.abortProwJob (keep removeFinalizer and the
log.WithField("prowjob_name", pj.Name) unchanged).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 57a3c51d-90ef-45d0-a65e-77b0ba61ba1f
📒 Files selected for processing (4)
pkg/controller/ephemeralcluster/prowjobreconciler.gopkg/controller/ephemeralcluster/prowjobreconciler_test.gopkg/controller/ephemeralcluster/reconciler.gopkg/controller/ephemeralcluster/reconciler_test.go
| createSecret := false | ||
| if err := buildClient.Get(ctx, types.NamespacedName{Name: api.EphemeralClusterTestDoneSignalSecretName, Namespace: ns}, &corev1.Secret{}); err != nil { | ||
| if apierrors.IsNotFound(err) { | ||
| createSecret = true | ||
| } else { | ||
| log.WithError(err).Warn("Failed to fetch the secret") | ||
| return reconcile.Result{}, err | ||
| } | ||
| } | ||
|
|
||
| log = log.WithField("secret", api.EphemeralClusterTestDoneSignalSecretName) | ||
| if createSecret { | ||
| if err := buildClient.Create(ctx, &corev1.Secret{ObjectMeta: metav1.ObjectMeta{ | ||
| Name: api.EphemeralClusterTestDoneSignalSecretName, | ||
| Namespace: ns, | ||
| }}); err != nil { | ||
| log.WithError(err).Warn("Failed to create the secret") | ||
| return reconcile.Result{}, err |
There was a problem hiding this comment.
Make the signal-secret creation idempotent.
This Get/Create sequence has a TOCTOU window: if test-done-signal is created after Line 129 but before Line 140, Create returns AlreadyExists and this reconcile fails even though the desired state is already satisfied. Please treat apierrors.IsAlreadyExists(err) as success, or switch to a create-or-update helper. The same fix should be shared with notifyTestComplete in pkg/controller/ephemeralcluster/reconciler.go.
Possible fix
if createSecret {
if err := buildClient.Create(ctx, &corev1.Secret{ObjectMeta: metav1.ObjectMeta{
Name: api.EphemeralClusterTestDoneSignalSecretName,
Namespace: ns,
}}); err != nil {
- log.WithError(err).Warn("Failed to create the secret")
- return reconcile.Result{}, err
+ if !apierrors.IsAlreadyExists(err) {
+ log.WithError(err).Warn("Failed to create the secret")
+ return reconcile.Result{}, err
+ }
}
log.Info("Secret created")
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| createSecret := false | |
| if err := buildClient.Get(ctx, types.NamespacedName{Name: api.EphemeralClusterTestDoneSignalSecretName, Namespace: ns}, &corev1.Secret{}); err != nil { | |
| if apierrors.IsNotFound(err) { | |
| createSecret = true | |
| } else { | |
| log.WithError(err).Warn("Failed to fetch the secret") | |
| return reconcile.Result{}, err | |
| } | |
| } | |
| log = log.WithField("secret", api.EphemeralClusterTestDoneSignalSecretName) | |
| if createSecret { | |
| if err := buildClient.Create(ctx, &corev1.Secret{ObjectMeta: metav1.ObjectMeta{ | |
| Name: api.EphemeralClusterTestDoneSignalSecretName, | |
| Namespace: ns, | |
| }}); err != nil { | |
| log.WithError(err).Warn("Failed to create the secret") | |
| return reconcile.Result{}, err | |
| createSecret := false | |
| if err := buildClient.Get(ctx, types.NamespacedName{Name: api.EphemeralClusterTestDoneSignalSecretName, Namespace: ns}, &corev1.Secret{}); err != nil { | |
| if apierrors.IsNotFound(err) { | |
| createSecret = true | |
| } else { | |
| log.WithError(err).Warn("Failed to fetch the secret") | |
| return reconcile.Result{}, err | |
| } | |
| } | |
| log = log.WithField("secret", api.EphemeralClusterTestDoneSignalSecretName) | |
| if createSecret { | |
| if err := buildClient.Create(ctx, &corev1.Secret{ObjectMeta: metav1.ObjectMeta{ | |
| Name: api.EphemeralClusterTestDoneSignalSecretName, | |
| Namespace: ns, | |
| }}); err != nil { | |
| if !apierrors.IsAlreadyExists(err) { | |
| log.WithError(err).Warn("Failed to create the secret") | |
| return reconcile.Result{}, err | |
| } | |
| } | |
| log.Info("Secret created") | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/controller/ephemeralcluster/prowjobreconciler.go` around lines 128 - 145,
The secret creation has a TOCTOU race: keep the Get/Create flow idempotent by
treating AlreadyExists as success or by using a create-or-update helper;
specifically, in prowjobreconciler.go where you call buildClient.Get(...) and
then buildClient.Create(...) for api.EphemeralClusterTestDoneSignalSecretName,
catch apierrors.IsAlreadyExists(err) from Create and treat it as non-fatal
(log.Debug and continue) or replace the Create with
controllerutil.CreateOrUpdate (or client.Patch/Update) to ensure idempotency;
apply the same change in notifyTestComplete in reconciler.go so both places
accept AlreadyExists as success or use the same create-or-update helper.
There was a problem hiding this comment.
This should never happen in practice because the function gracefullyTerminateClusterProvisioning is executed only when the EphemeralCluster object doesn't exist anymore, therefore there is no way for the reconciler in pkg/controller/ephemeralcluster/reconciler.go to create the test-done-signal.
In notifyTestComplete requests are handled sequentially, therefore there is no way to interleave Get/Create sequence.
Am I missing anything?
There was a problem hiding this comment.
@danilo-gemoli, your analysis is correct — you're not missing anything.
The two conditions together do eliminate the race in practice:
-
Mutual exclusion via
EphemeralClusterexistence —gracefullyTerminateClusterProvisioningis only reached when the EC object is already gone. SincenotifyTestCompleteinreconciler.gois driven by the EC's lifecycle, there is no live EC for it to reconcile, so it cannot concurrently createtest-done-signalfor the same cluster. -
MaxConcurrentReconciles: 1— the ProwJob reconciler is single-threaded, so two concurrent runs ofgracefullyTerminateClusterProvisioningfor the same ProwJob are impossible, ruling out the only remaining interleaver.
My concern was based on a general controller-runtime pattern where these invariants don't hold; in this specific design they do. Happy to drop the suggestion.
1766c0c to
38eecaa
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
pkg/controller/ephemeralcluster/prowjobreconciler_test.go (1)
303-318: Make secret assertions deterministic and diagnostics clearer.This block currently lists all secrets and assumes
len == 1, which is brittle and produces misleading errors ("Too many secrets"also when zero). It also doesn’t guard for a missing cluster client and uses"unexpected prowjob"while comparing a secret.Proposed refactor
if tc.wantSecret != nil { - gotSecrets := &corev1.SecretList{} - client := clients[tc.pj.Spec.Cluster] - if err := client.List(context.TODO(), gotSecrets); err != nil { - t.Errorf("get secrets: %s", err) - } - - if len(gotSecrets.Items) == 1 { - gotSecret := gotSecrets.Items[0] - ignoreFields := cmpopts.IgnoreFields(corev1.Secret{}, "ResourceVersion") - if diff := cmp.Diff(tc.wantSecret, &gotSecret, ignoreFields); diff != "" { - t.Errorf("unexpected prowjob: %s", diff) - } - } else { - t.Errorf("Too many secrets") - } + buildClient, ok := clients[tc.pj.Spec.Cluster] + if !ok { + t.Fatalf("missing build client for cluster %q", tc.pj.Spec.Cluster) + } + + gotSecret := &corev1.Secret{} + key := types.NamespacedName{ + Name: tc.wantSecret.Name, + Namespace: tc.wantSecret.Namespace, + } + if err := buildClient.Get(context.TODO(), key, gotSecret); err != nil { + t.Fatalf("get secret %s/%s: %v", key.Namespace, key.Name, err) + } + + ignoreFields := cmpopts.IgnoreFields(corev1.Secret{}, "ResourceVersion") + if diff := cmp.Diff(tc.wantSecret, gotSecret, ignoreFields); diff != "" { + t.Errorf("unexpected secret: %s", diff) + } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/controller/ephemeralcluster/prowjobreconciler_test.go` around lines 303 - 318, The secret assertion is brittle: locate the test block using gotSecrets, corev1.SecretList, clients and tc.pj.Spec.Cluster and make it deterministic by first verifying the cluster client exists (clients[tc.pj.Spec.Cluster]) and returning a clear error if missing, then fetch the specific secret instead of assuming a single list entry—either call client.Get for tc.wantSecret.Name/Namespace or List with a field/label selector matching tc.wantSecret, and handle zero vs multiple matches with distinct messages ("no secret found: <name>" vs "multiple secrets found: <name>"). Finally use cmp.Diff to compare tc.wantSecret to the retrieved secret and update the error text to reference "secret" (not "prowjob") and include the diff in the error message.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@pkg/controller/ephemeralcluster/prowjobreconciler_test.go`:
- Around line 303-318: The secret assertion is brittle: locate the test block
using gotSecrets, corev1.SecretList, clients and tc.pj.Spec.Cluster and make it
deterministic by first verifying the cluster client exists
(clients[tc.pj.Spec.Cluster]) and returning a clear error if missing, then fetch
the specific secret instead of assuming a single list entry—either call
client.Get for tc.wantSecret.Name/Namespace or List with a field/label selector
matching tc.wantSecret, and handle zero vs multiple matches with distinct
messages ("no secret found: <name>" vs "multiple secrets found: <name>").
Finally use cmp.Diff to compare tc.wantSecret to the retrieved secret and update
the error text to reference "secret" (not "prowjob") and include the diff in the
error message.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 2b8c4c26-b252-4a85-82c3-3d56c13b32a2
📒 Files selected for processing (4)
pkg/controller/ephemeralcluster/prowjobreconciler.gopkg/controller/ephemeralcluster/prowjobreconciler_test.gopkg/controller/ephemeralcluster/reconciler.gopkg/controller/ephemeralcluster/reconciler_test.go
✅ Files skipped from review due to trivial changes (1)
- pkg/controller/ephemeralcluster/prowjobreconciler.go
🚧 Files skipped from review as they are similar to previous changes (2)
- pkg/controller/ephemeralcluster/reconciler.go
- pkg/controller/ephemeralcluster/reconciler_test.go
|
@danilo-gemoli: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Up until now, when the
EphemeralClusterobject is gone, theProwJobrunningci-operatoris simply aborted.This might lead to an unpleasant situation in which
ci-operatoris getting killed in the middle of provisioning an ephemeral-cluster, without giving it the chance to clean up the resources that have been created in the cloud so far.With the new approach, the controller lets
ci-operatorfinish the provisioning procedure and then triggers the deprovision right away, without waiting for the Konflux's test to complete.In this way we let
ci-operatorto gracefully terminate its job.In order to ease the review, I will provide a real use case:
EphemeralClusterobject inapp.cigets delete. This might result as a consequence of the Konflux's pipeling being canceled, as an example.prowjobreconcilerkicks in and get theProwJobobject tied to theEphemeralCluster. We don't know what is the status of thisProwJob, it might be running, pending, etc.prowjobreconcilerlooks for theci-op-xxxNS that should have been created by theci-operatorinstance run by theProwJobtest-done-signalsecret in order to informci-operatorto start the cluster deprovisioning procedures as soon as possible.ProwJob, as it currently does.Summary by CodeRabbit
Bug Fixes
Improvements
Tests