docs(proposal): Simplify Concurrency Model by vimalk78 · Pull Request #2435 · sustainable-computing-io/kepler

vimalk78 · 2026-03-05T06:11:35Z

Document Kepler's concurrency design, the synchronization primitives
it requires, related bug history, and propose a simplified alternative
using a ticker loop with mutex-guarded freshness. Includes profiling
data showing scrape-dominated collection pattern.

vprashar2929 · 2026-03-25T06:53:25Z

+| `99a271ec` | #2146 | Added context check after `select` in `scheduleNextCollection` | Recursive goroutine scheduling — `select` on timer + context can pick timer after cancellation |
+| `c5a2be5e` | —     | Replaced `RWMutex` with `atomic.Pointer` for snapshot access   | Lock contention between collection goroutine and scrape handler reading snapshots              |
+| `7c0ee313` | #2386 | Added synchronization to test cleanup                          | Recursive collection goroutine outlived test — no lifecycle tracking                           |
+| `a64511b4` | —     | Moved mock cleanup to safe point                               | Background collection goroutine read mock expectations during `t.Cleanup`                      |


Does these exist?

+1 I could not find a64511b4 👀

vprashar2929 · 2026-03-25T06:59:24Z

+            → refreshSnapshot()          ← blocks here for full collection
+```
+
+The scrape handler blocks for the entire collection duration.


Worth mentioning about prometheus scrape timeout?

The default scrape timeout is 10s. Next paragraph mentions that:

On a production node with thousands of processes,
collection can take hundreds of milliseconds

This is an order of magnitude margin, therefore timeout is not a risk

vprashar2929 · 2026-03-25T07:00:13Z

+
+```go
+func (pm *PowerMonitor) Snapshot() (*Snapshot, error) {
+    if !pm.isFresh() {


Doesn't this looks like a race condition?

The outer isFresh() reads the snapshot timestamp via the atomic pointer, which should be safe without the mutex - @vimalk78 could you confirm this please?

nikimanoledaki · 2026-04-08T08:20:53Z

+<!-- SPDX-FileCopyrightText: 2025 The Kepler Authors -->
+<!-- SPDX-License-Identifier: Apache-2.0 -->
+
+# EP-004: Simplify Kepler's Concurrency Model


Could we increment this proposal to EP-005 please to separate it from the pre-existing EP-004? We would need to update all references including the PR title, images, etc. This would:

preserve the history and unique reference to EPs

help us create EP-006 for the upcoming model training EP

Following up with this - please ignore my previous message! Whichever one gets merged first will take EP-004 and the next one will need to increment its version.

In the KEP process, unmerged EP PRs do not reserve a number so this can stay as is for now 👍

Document Kepler's concurrency design, the synchronization primitives it requires, related bug history, and propose a simplified alternative using a ticker loop with mutex-guarded freshness. Includes profiling data showing scrape-dominated collection pattern. Signed-off-by: Vimal Kumar <vimal78@gmail.com>

nikimanoledaki

LGTM! Left some comments 👍

nikimanoledaki · 2026-04-13T09:15:36Z

+
+The concurrency model serves a deliberate purpose: **data freshness at
+scrape time** (Architecture Principle #4). Prometheus scrapes at its own
+schedule (e.g., every 15-30s), which may not align with Kepler's


Suggested change

schedule (e.g., every 15-30s), which may not align with Kepler's

schedule (e.g. the [default](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#duration) `scrape_interval` is `1m`), which may not align with Kepler's

nikimanoledaki · 2026-04-13T09:21:52Z

+            → refreshSnapshot()          ← blocks here for full collection
+```
+
+The scrape handler blocks for the entire collection duration.


The default scrape timeout is 10s. Next paragraph mentions that:

On a production node with thousands of processes,
collection can take hundreds of milliseconds

This is an order of magnitude margin, therefore timeout is not a risk

nikimanoledaki · 2026-04-13T09:26:06Z

+| `99a271ec` | #2146 | Added context check after `select` in `scheduleNextCollection` | Recursive goroutine scheduling — `select` on timer + context can pick timer after cancellation |
+| `c5a2be5e` | —     | Replaced `RWMutex` with `atomic.Pointer` for snapshot access   | Lock contention between collection goroutine and scrape handler reading snapshots              |
+| `7c0ee313` | #2386 | Added synchronization to test cleanup                          | Recursive collection goroutine outlived test — no lifecycle tracking                           |
+| `a64511b4` | —     | Moved mock cleanup to safe point                               | Background collection goroutine read mock expectations during `t.Cleanup`                      |


+1 I could not find a64511b4 👀

nikimanoledaki · 2026-04-13T09:28:39Z

+Replace the recursive goroutine scheduler with a simple ticker loop
+and make all collection sequential:
+
+```go
+func (pm *PowerMonitor) Run(ctx context.Context) error {
+    // Initial collection
+    pm.collect()
+
+    ticker := time.NewTicker(pm.interval)
+    defer ticker.Stop()
+
+    for {
+        select {
+        case <-ctx.Done():
+            return nil
+        case <-ticker.C:
+            pm.collect()
+        }
+    }
+}
+```


We use the ticker loop pattern in another project that I maintain, cloudcost-exporter. For example, take a look here: https://github.com/grafana/cloudcost-exporter/blob/main/pkg/google/vpc/vpc.go#L93

go func() { ticker := time.NewTicker(PriceRefreshInterval) defer ticker.Stop() for { select { case <-ctx.Done(): logger.Info("VPC pricing refresh cancelled") return case <-ticker.C: logger.Info("Refreshing VPC pricing data") if err := pricingMap.Refresh(ctx); err != nil { logger.Error("Failed to refresh VPC pricing data", "error", err) } } } }()

+1 for this approach!

nikimanoledaki · 2026-04-13T09:33:58Z

+
+```go
+func (pm *PowerMonitor) Snapshot() (*Snapshot, error) {
+    if !pm.isFresh() {


The outer isFresh() reads the snapshot timestamp via the atomic pointer, which should be safe without the mutex - @vimalk78 could you confirm this please?

github-actions Bot added the docs Documentation changes label Mar 5, 2026

vimalk78 force-pushed the simplify-concurrency branch from 08e8b1e to 465c787 Compare March 5, 2026 06:13

vprashar2929 reviewed Mar 25, 2026

View reviewed changes

nikimanoledaki reviewed Apr 8, 2026

View reviewed changes

vimalk78 mentioned this pull request Apr 8, 2026

docs(proposal): align EP roles & writing process with KEP conventions #2457

Merged

vimalk78 force-pushed the simplify-concurrency branch from 465c787 to 6b7b200 Compare April 8, 2026 10:45

vimalk78 changed the title ~~docs(proposal): add EP-004 Simplify Concurrency Model~~ docs(proposal): Simplify Concurrency Model Apr 8, 2026

nikimanoledaki approved these changes Apr 13, 2026

View reviewed changes

	schedule (e.g., every 15-30s), which may not align with Kepler's
	schedule (e.g. the [default](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#duration) `scrape_interval` is `1m`), which may not align with Kepler's

Conversation

vimalk78 commented Mar 5, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nikimanoledaki Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nikimanoledaki left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nikimanoledaki Apr 8, 2026 •

edited

Loading