feat: add Kubernetes health probe endpoints#2327
feat: add Kubernetes health probe endpoints#2327robinlem wants to merge 2 commits intosustainable-computing-io:mainfrom
Conversation
Add dedicated health check endpoints for improved Kubernetes integration: - /probe/livez: Liveness probe with heartbeat-based monitoring - /probe/readyz: Readiness probe checking data availability Key features: * Lightweight checks (<10µs response time) using atomic operations * Support for both passive (interval=0) and active collection modes * JSON response format with status, timestamp, and duration * Configurable tolerance (2x collection interval) for liveness detection * Thread-safe implementation with comprehensive test coverage Implementation: * Add LiveChecker and ReadyChecker interfaces to service package * Implement health checks in PowerMonitor with heartbeat tracking * Create HealthProbeService for HTTP endpoint handling * Update Helm chart to use new endpoints by default Breaking change: Helm chart now uses /probe/* endpoints instead of /metrics for health probes, providing more accurate health status detection.
Add healthCheckTolerance option to monitor for flexible liveness probe timing. Default remains 2.0x interval for backward compatibility.
| if pm.snapshot.Load() == nil { | ||
| return false, fmt.Errorf("no data yet") | ||
| } |
There was a problem hiding this comment.
This doesn't make it not ready. No snapshot only means that no scrape has been made and does not mean monitor is not in ready state ...
There was a problem hiding this comment.
Monitor is Ready once Run is called.
There was a problem hiding this comment.
I think snapshot-based readiness check is actually better.
Why: Run() can be called successfully, but if firstReading() or calculatePower() fail in refreshSnapshot() (called with run() ), we get:
- running = true (service started)
- snapshot = nil (no data due to collection error)
Don't you think for a monitoring service, "ready" should mean "can provide data", not just "is running".
I think your thoughs make more sense for a startup probe but not a readyness probe.
I have read this documentation : https://kubernetes.io/docs/reference/using-api/health-checks/
It's not crystal clear but they say “The kubelet uses readiness probes to know when a container is ready to start accepting traffic.”
| ) | ||
|
|
||
| // Create health probe service | ||
| healthProbeService := server.NewHealthProbeService(apiServer, pm, pm, logger) |
There was a problem hiding this comment.
why do we need 2 pm 🤔 ? Shouldn't the health-probe have access to all services, filter those that have Liveness and Readyness checks?
Also keep in mind that when a service's Init() is done, and all services's Run (see internal/service.Runner) is blocked, kepler should be in Ready state. (We may have to rethink the readiness probe, there is chance to simplify it).
There was a problem hiding this comment.
The pm (PowerMonitor) is passed twice because it implements both LiveChecker and ReadyChecker interfaces, serving as both the liveness and readiness probe checker. I wanted a clear split for both, but we can change.
|
This PR is stale because it has been open 60 days with no activity. |
Add dedicated health check endpoints for improved Kubernetes integration:
Key features:
Implementation:
Breaking change: Helm chart now uses /probe/* endpoints instead of /metrics for health probes, providing more accurate health status detection.
Closes 2282