Skip to content

docs(proposal): Simplify Concurrency Model#2435

Open
vimalk78 wants to merge 1 commit intosustainable-computing-io:mainfrom
vimalk78:simplify-concurrency
Open

docs(proposal): Simplify Concurrency Model#2435
vimalk78 wants to merge 1 commit intosustainable-computing-io:mainfrom
vimalk78:simplify-concurrency

Conversation

@vimalk78
Copy link
Copy Markdown
Collaborator

@vimalk78 vimalk78 commented Mar 5, 2026

Document Kepler's concurrency design, the synchronization primitives
it requires, related bug history, and propose a simplified alternative
using a ticker loop with mutex-guarded freshness. Includes profiling
data showing scrape-dominated collection pattern.

@github-actions github-actions Bot added the docs Documentation changes label Mar 5, 2026
@vimalk78 vimalk78 force-pushed the simplify-concurrency branch from 08e8b1e to 465c787 Compare March 5, 2026 06:13
| `99a271ec` | #2146 | Added context check after `select` in `scheduleNextCollection` | Recursive goroutine scheduling — `select` on timer + context can pick timer after cancellation |
| `c5a2be5e` | — | Replaced `RWMutex` with `atomic.Pointer` for snapshot access | Lock contention between collection goroutine and scrape handler reading snapshots |
| `7c0ee313` | #2386 | Added synchronization to test cleanup | Recursive collection goroutine outlived test — no lifecycle tracking |
| `a64511b4` | — | Moved mock cleanup to safe point | Background collection goroutine read mock expectations during `t.Cleanup` |
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does these exist?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 I could not find a64511b4 👀

→ refreshSnapshot() ← blocks here for full collection
```

The scrape handler blocks for the entire collection duration.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth mentioning about prometheus scrape timeout?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default scrape timeout is 10s. Next paragraph mentions that:

On a production node with thousands of processes,
collection can take hundreds of milliseconds

This is an order of magnitude margin, therefore timeout is not a risk


```go
func (pm *PowerMonitor) Snapshot() (*Snapshot, error) {
if !pm.isFresh() {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this looks like a race condition?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The outer isFresh() reads the snapshot timestamp via the atomic pointer, which should be safe without the mutex - @vimalk78 could you confirm this please?

<!-- SPDX-FileCopyrightText: 2025 The Kepler Authors -->
<!-- SPDX-License-Identifier: Apache-2.0 -->

# EP-004: Simplify Kepler's Concurrency Model
Copy link
Copy Markdown
Collaborator

@nikimanoledaki nikimanoledaki Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we increment this proposal to EP-005 please to separate it from the pre-existing EP-004? We would need to update all references including the PR title, images, etc. This would:

  • preserve the history and unique reference to EPs
  • help us create EP-006 for the upcoming model training EP

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following up with this - please ignore my previous message! Whichever one gets merged first will take EP-004 and the next one will need to increment its version.

In the KEP process, unmerged EP PRs do not reserve a number so this can stay as is for now 👍

  Document Kepler's concurrency design, the synchronization primitives
  it requires, related bug history, and propose a simplified alternative
  using a ticker loop with mutex-guarded freshness. Includes profiling
  data showing scrape-dominated collection pattern.

Signed-off-by: Vimal Kumar <vimal78@gmail.com>
@vimalk78 vimalk78 force-pushed the simplify-concurrency branch from 465c787 to 6b7b200 Compare April 8, 2026 10:45
@vimalk78 vimalk78 changed the title docs(proposal): add EP-004 Simplify Concurrency Model docs(proposal): Simplify Concurrency Model Apr 8, 2026
Copy link
Copy Markdown
Collaborator

@nikimanoledaki nikimanoledaki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Left some comments 👍


The concurrency model serves a deliberate purpose: **data freshness at
scrape time** (Architecture Principle #4). Prometheus scrapes at its own
schedule (e.g., every 15-30s), which may not align with Kepler's
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
schedule (e.g., every 15-30s), which may not align with Kepler's
schedule (e.g. the [default](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#duration) `scrape_interval` is `1m`), which may not align with Kepler's

→ refreshSnapshot() ← blocks here for full collection
```

The scrape handler blocks for the entire collection duration.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default scrape timeout is 10s. Next paragraph mentions that:

On a production node with thousands of processes,
collection can take hundreds of milliseconds

This is an order of magnitude margin, therefore timeout is not a risk

| `99a271ec` | #2146 | Added context check after `select` in `scheduleNextCollection` | Recursive goroutine scheduling — `select` on timer + context can pick timer after cancellation |
| `c5a2be5e` | — | Replaced `RWMutex` with `atomic.Pointer` for snapshot access | Lock contention between collection goroutine and scrape handler reading snapshots |
| `7c0ee313` | #2386 | Added synchronization to test cleanup | Recursive collection goroutine outlived test — no lifecycle tracking |
| `a64511b4` | — | Moved mock cleanup to safe point | Background collection goroutine read mock expectations during `t.Cleanup` |
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 I could not find a64511b4 👀

Comment on lines +329 to +349
Replace the recursive goroutine scheduler with a simple ticker loop
and make all collection sequential:

```go
func (pm *PowerMonitor) Run(ctx context.Context) error {
// Initial collection
pm.collect()

ticker := time.NewTicker(pm.interval)
defer ticker.Stop()

for {
select {
case <-ctx.Done():
return nil
case <-ticker.C:
pm.collect()
}
}
}
```
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use the ticker loop pattern in another project that I maintain, cloudcost-exporter. For example, take a look here: https://github.com/grafana/cloudcost-exporter/blob/main/pkg/google/vpc/vpc.go#L93

	go func() {
		ticker := time.NewTicker(PriceRefreshInterval)
		defer ticker.Stop()

		for {
			select {
			case <-ctx.Done():
				logger.Info("VPC pricing refresh cancelled")
				return
			case <-ticker.C:
				logger.Info("Refreshing VPC pricing data")
				if err := pricingMap.Refresh(ctx); err != nil {
					logger.Error("Failed to refresh VPC pricing data", "error", err)
				}
			}
		}
	}()

+1 for this approach!


```go
func (pm *PowerMonitor) Snapshot() (*Snapshot, error) {
if !pm.isFresh() {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The outer isFresh() reads the snapshot timestamp via the atomic pointer, which should be safe without the mutex - @vimalk78 could you confirm this please?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs Documentation changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants