feat: add Kubernetes health probe endpoints by robinlem · Pull Request #2327 · sustainable-computing-io/kepler

robinlem · 2025-09-17T23:42:11Z

Add dedicated health check endpoints for improved Kubernetes integration:

/probe/livez: Liveness probe with heartbeat-based monitoring
/probe/readyz: Readiness probe checking data availability

Key features:

Lightweight checks (<10µs response time) using atomic operations
Support for both passive (interval=0) and active collection modes
JSON response format with status, timestamp, and duration
Configurable tolerance (2x collection interval) for liveness detection
Thread-safe implementation with comprehensive test coverage

Implementation:

Add LiveChecker and ReadyChecker interfaces to service package
Implement health checks in PowerMonitor with heartbeat tracking
Create HealthProbeService for HTTP endpoint handling
Update Helm chart to use new endpoints by default

Breaking change: Helm chart now uses /probe/* endpoints instead of /metrics for health probes, providing more accurate health status detection.

Closes 2282

Add dedicated health check endpoints for improved Kubernetes integration: - /probe/livez: Liveness probe with heartbeat-based monitoring - /probe/readyz: Readiness probe checking data availability Key features: * Lightweight checks (<10µs response time) using atomic operations * Support for both passive (interval=0) and active collection modes * JSON response format with status, timestamp, and duration * Configurable tolerance (2x collection interval) for liveness detection * Thread-safe implementation with comprehensive test coverage Implementation: * Add LiveChecker and ReadyChecker interfaces to service package * Implement health checks in PowerMonitor with heartbeat tracking * Create HealthProbeService for HTTP endpoint handling * Update Helm chart to use new endpoints by default Breaking change: Helm chart now uses /probe/* endpoints instead of /metrics for health probes, providing more accurate health status detection.

Add healthCheckTolerance option to monitor for flexible liveness probe timing. Default remains 2.0x interval for backward compatibility.

sthaha · 2025-09-18T05:12:00Z

+	if pm.snapshot.Load() == nil {
+		return false, fmt.Errorf("no data yet")
+	}


This doesn't make it not ready. No snapshot only means that no scrape has been made and does not mean monitor is not in ready state ...

Monitor is Ready once Run is called.

I think snapshot-based readiness check is actually better.

Why: Run() can be called successfully, but if firstReading() or calculatePower() fail in refreshSnapshot() (called with run() ), we get:

running = true (service started)

snapshot = nil (no data due to collection error)

Don't you think for a monitoring service, "ready" should mean "can provide data", not just "is running".
I think your thoughs make more sense for a startup probe but not a readyness probe.

I have read this documentation : https://kubernetes.io/docs/reference/using-api/health-checks/
It's not crystal clear but they say “The kubelet uses readiness probes to know when a container is ready to start accepting traffic.”

sthaha · 2025-09-18T05:27:32Z

 	)

+	// Create health probe service
+	healthProbeService := server.NewHealthProbeService(apiServer, pm, pm, logger)


why do we need 2 pm 🤔 ? Shouldn't the health-probe have access to all services, filter those that have Liveness and Readyness checks?

Also keep in mind that when a service's Init() is done, and all services's Run (see internal/service.Runner) is blocked, kepler should be in Ready state. (We may have to rethink the readiness probe, there is chance to simplify it).

The pm (PowerMonitor) is passed twice because it implements both LiveChecker and ReadyChecker interfaces, serving as both the liveness and readiness probe checker. I wanted a clear split for both, but we can change.

github-actions · 2026-04-09T00:34:27Z

This PR is stale because it has been open 60 days with no activity.

github-actions Bot added the feat A new feature or enhancement label Sep 17, 2025

feat: make health check tolerance configurable

50efc05

Add healthCheckTolerance option to monitor for flexible liveness probe timing. Default remains 2.0x interval for backward compatibility.

sthaha reviewed Sep 18, 2025

View reviewed changes

robinlem marked this pull request as ready for review September 22, 2025 13:42

sthaha requested a review from vimalk78 September 30, 2025 09:30

github-actions Bot added the stale Stale state - issue will be closed in 7 days label Apr 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add Kubernetes health probe endpoints#2327

feat: add Kubernetes health probe endpoints#2327
robinlem wants to merge 2 commits intosustainable-computing-io:mainfrom
robinlem:add-health-check-endpoints-for-kubernetes-probes

robinlem commented Sep 17, 2025

Uh oh!

sthaha Sep 18, 2025

Uh oh!

sthaha Sep 18, 2025

Uh oh!

robinlem Sep 19, 2025 •

edited

Loading

Uh oh!

sthaha Sep 18, 2025

Uh oh!

robinlem Sep 19, 2025

Uh oh!

github-actions Bot commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

robinlem commented Sep 17, 2025

Uh oh!

sthaha Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

sthaha Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

robinlem Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sthaha Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

robinlem Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

robinlem Sep 19, 2025 •

edited

Loading