docs(proposal): Simplify Concurrency Model#2435
docs(proposal): Simplify Concurrency Model#2435vimalk78 wants to merge 1 commit intosustainable-computing-io:mainfrom
Conversation
08e8b1e to
465c787
Compare
| | `99a271ec` | #2146 | Added context check after `select` in `scheduleNextCollection` | Recursive goroutine scheduling — `select` on timer + context can pick timer after cancellation | | ||
| | `c5a2be5e` | — | Replaced `RWMutex` with `atomic.Pointer` for snapshot access | Lock contention between collection goroutine and scrape handler reading snapshots | | ||
| | `7c0ee313` | #2386 | Added synchronization to test cleanup | Recursive collection goroutine outlived test — no lifecycle tracking | | ||
| | `a64511b4` | — | Moved mock cleanup to safe point | Background collection goroutine read mock expectations during `t.Cleanup` | |
There was a problem hiding this comment.
+1 I could not find a64511b4 👀
| → refreshSnapshot() ← blocks here for full collection | ||
| ``` | ||
|
|
||
| The scrape handler blocks for the entire collection duration. |
There was a problem hiding this comment.
Worth mentioning about prometheus scrape timeout?
There was a problem hiding this comment.
The default scrape timeout is 10s. Next paragraph mentions that:
On a production node with thousands of processes,
collection can take hundreds of milliseconds
This is an order of magnitude margin, therefore timeout is not a risk
|
|
||
| ```go | ||
| func (pm *PowerMonitor) Snapshot() (*Snapshot, error) { | ||
| if !pm.isFresh() { |
There was a problem hiding this comment.
Doesn't this looks like a race condition?
There was a problem hiding this comment.
The outer isFresh() reads the snapshot timestamp via the atomic pointer, which should be safe without the mutex - @vimalk78 could you confirm this please?
| <!-- SPDX-FileCopyrightText: 2025 The Kepler Authors --> | ||
| <!-- SPDX-License-Identifier: Apache-2.0 --> | ||
|
|
||
| # EP-004: Simplify Kepler's Concurrency Model |
There was a problem hiding this comment.
Could we increment this proposal to EP-005 please to separate it from the pre-existing EP-004? We would need to update all references including the PR title, images, etc. This would:
- preserve the history and unique reference to EPs
- help us create
EP-006for the upcoming model training EP
There was a problem hiding this comment.
Following up with this - please ignore my previous message! Whichever one gets merged first will take EP-004 and the next one will need to increment its version.
In the KEP process, unmerged EP PRs do not reserve a number so this can stay as is for now 👍
Document Kepler's concurrency design, the synchronization primitives it requires, related bug history, and propose a simplified alternative using a ticker loop with mutex-guarded freshness. Includes profiling data showing scrape-dominated collection pattern. Signed-off-by: Vimal Kumar <vimal78@gmail.com>
465c787 to
6b7b200
Compare
nikimanoledaki
left a comment
There was a problem hiding this comment.
LGTM! Left some comments 👍
|
|
||
| The concurrency model serves a deliberate purpose: **data freshness at | ||
| scrape time** (Architecture Principle #4). Prometheus scrapes at its own | ||
| schedule (e.g., every 15-30s), which may not align with Kepler's |
There was a problem hiding this comment.
| schedule (e.g., every 15-30s), which may not align with Kepler's | |
| schedule (e.g. the [default](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#duration) `scrape_interval` is `1m`), which may not align with Kepler's |
| → refreshSnapshot() ← blocks here for full collection | ||
| ``` | ||
|
|
||
| The scrape handler blocks for the entire collection duration. |
There was a problem hiding this comment.
The default scrape timeout is 10s. Next paragraph mentions that:
On a production node with thousands of processes,
collection can take hundreds of milliseconds
This is an order of magnitude margin, therefore timeout is not a risk
| | `99a271ec` | #2146 | Added context check after `select` in `scheduleNextCollection` | Recursive goroutine scheduling — `select` on timer + context can pick timer after cancellation | | ||
| | `c5a2be5e` | — | Replaced `RWMutex` with `atomic.Pointer` for snapshot access | Lock contention between collection goroutine and scrape handler reading snapshots | | ||
| | `7c0ee313` | #2386 | Added synchronization to test cleanup | Recursive collection goroutine outlived test — no lifecycle tracking | | ||
| | `a64511b4` | — | Moved mock cleanup to safe point | Background collection goroutine read mock expectations during `t.Cleanup` | |
There was a problem hiding this comment.
+1 I could not find a64511b4 👀
| Replace the recursive goroutine scheduler with a simple ticker loop | ||
| and make all collection sequential: | ||
|
|
||
| ```go | ||
| func (pm *PowerMonitor) Run(ctx context.Context) error { | ||
| // Initial collection | ||
| pm.collect() | ||
|
|
||
| ticker := time.NewTicker(pm.interval) | ||
| defer ticker.Stop() | ||
|
|
||
| for { | ||
| select { | ||
| case <-ctx.Done(): | ||
| return nil | ||
| case <-ticker.C: | ||
| pm.collect() | ||
| } | ||
| } | ||
| } | ||
| ``` |
There was a problem hiding this comment.
We use the ticker loop pattern in another project that I maintain, cloudcost-exporter. For example, take a look here: https://github.com/grafana/cloudcost-exporter/blob/main/pkg/google/vpc/vpc.go#L93
go func() {
ticker := time.NewTicker(PriceRefreshInterval)
defer ticker.Stop()
for {
select {
case <-ctx.Done():
logger.Info("VPC pricing refresh cancelled")
return
case <-ticker.C:
logger.Info("Refreshing VPC pricing data")
if err := pricingMap.Refresh(ctx); err != nil {
logger.Error("Failed to refresh VPC pricing data", "error", err)
}
}
}
}()
+1 for this approach!
|
|
||
| ```go | ||
| func (pm *PowerMonitor) Snapshot() (*Snapshot, error) { | ||
| if !pm.isFresh() { |
There was a problem hiding this comment.
The outer isFresh() reads the snapshot timestamp via the atomic pointer, which should be safe without the mutex - @vimalk78 could you confirm this please?
Document Kepler's concurrency design, the synchronization primitives
it requires, related bug history, and propose a simplified alternative
using a ticker loop with mutex-guarded freshness. Includes profiling
data showing scrape-dominated collection pattern.