metrics: measure idle time during builds #2323

jsternberg · 2024-03-07T20:53:41Z

This measures the amount of time spent idle during the build. This is
done by collecting the set of task times, determining which sections
contain gaps where no task is running, and aggregating that duration
into a metric.

tonistiigi · 2024-03-07T22:16:02Z

util/progress/metricwriter.go

+	mr.mu.Lock()
+	defer mr.mu.Unlock()
+
+	sort.Slice(mr.Started, func(i, j int) bool {


Move this logic to a separate function that can be unit-tested so it calculates the expected idle time.

tonistiigi · 2024-03-07T22:17:32Z

util/progress/metricwriter.go

+	defer mr.mu.Unlock()
+
+	for _, v := range ss.Vertexes {
+		if v.Started == nil || v.Completed == nil {


It is possible to get new update on vertex that already had completed time. I don't think it changes the output atm. though.

tonistiigi · 2024-03-07T22:21:35Z

util/progress/metricwriter.go

+}
+
+func (mr *idleMetricRecorder) calculateIdleTime(ctx context.Context, o metric.Float64Observer) error {
+	mr.mu.Lock()


Why are the locks needed in here? Can it be called multiple times? I think that would make the calculation incorrect. Eg.

A 0 - 2 B 1 - 10 C 3 - 5

If the calculation happens on 6 then B has no completed time yet and 2-3 looks like idle, while it really is not.

The locks are mostly just to ensure that there's no race condition. The way OTEL observation functions work is they're invoked when the Collect method is called. For normal service programs, this can be done automatically with the periodic reader or through something like prometheus metrics.

For this one, the only time this realistically gets invoked is at the program end when the manual reader calls collect which happens after the build is over.

I figured though that it's theoretically possible to be invoked from at any time and locks would be a low-cost solution to ensuring this didn't cause any race conditions inadvertently.

For the actual calculation, yes it could feasibly be called multiple times but the way we have things set up it won't be. It also doesn't use now to calculate idle time. The metric is also a gauge so the number can go up or down.

It would be possible to get the calculation to be correct for the situation you described, but it probably just wouldn't be worth the extra logic.

Can't we just force one metric send for this for each build (and maybe for others as well that are strictly 1-1)? I think it complicates the understanding if we have code has no guarantees how often and when it gets called while really we have only one specific place where we produce the event. Builds can be very long so I don't know if there are any actual guarantees that there will never be periodic updates, and if there are they are hard to follow. In this case it would cause wrong data, in the others it might not be that obvious what is happening in all possible cases.

Metric "sends" and metric gathering are separated from each other by the OTEL library. The exporter is the one responsible for sending and it's the one that determines how often that happens or if it even happens at all.

In OTEL, the resource attributes and the aggregation behavior are mostly what determine how this works. For a counter, which is what the other metrics are, it's a monotonically increasing sum so it's easier for us to just add values to it. But, it would still technically be possible for the reader to retrieve what the current counters are at any time.

For this metric, I chose a gauge to represent the value. The aggregation behavior for gauges is to always take the last value. The value can go up or down.

I still think this is correct, but if we want to make sure that this is always "correct", I can also factor in the start times for in progress vertices to avoid the problem you mentioned in the first comment. That way, we don't consider 2-3 an idle time because we have a vertex that started at 1 and it has no stop time at the moment.

This measures the amount of time spent idle during the build. This is done by collecting the set of task times, determining which sections contain gaps where no task is running, and aggregating that duration into a metric. Signed-off-by: Jonathan A. Sternberg <[email protected]>

jsternberg requested review from daghack and tonistiigi March 7, 2024 20:53

tonistiigi reviewed Mar 7, 2024

View reviewed changes

jsternberg force-pushed the build-idle-time-metric branch 2 times, most recently from 2868fed to 337a88b Compare March 15, 2024 20:35

jsternberg force-pushed the build-idle-time-metric branch from 337a88b to a4a8846 Compare March 18, 2024 13:43

tonistiigi approved these changes Mar 18, 2024

View reviewed changes

tonistiigi merged commit 4af0ed5 into docker:master Mar 18, 2024

jsternberg deleted the build-idle-time-metric branch March 19, 2024 12:07

crazy-max added the area/metrics label Jun 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

metrics: measure idle time during builds #2323

metrics: measure idle time during builds #2323

Uh oh!

jsternberg commented Mar 7, 2024

Uh oh!

tonistiigi Mar 7, 2024

Uh oh!

tonistiigi Mar 7, 2024

Uh oh!

tonistiigi Mar 7, 2024

Uh oh!

jsternberg Mar 8, 2024

Uh oh!

tonistiigi Mar 11, 2024

Uh oh!

jsternberg Mar 11, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

metrics: measure idle time during builds #2323

metrics: measure idle time during builds #2323

Uh oh!

Conversation

jsternberg commented Mar 7, 2024

Uh oh!

tonistiigi Mar 7, 2024

Choose a reason for hiding this comment

Uh oh!

tonistiigi Mar 7, 2024

Choose a reason for hiding this comment

Uh oh!

tonistiigi Mar 7, 2024

Choose a reason for hiding this comment

Uh oh!

jsternberg Mar 8, 2024

Choose a reason for hiding this comment

Uh oh!

tonistiigi Mar 11, 2024

Choose a reason for hiding this comment

Uh oh!

jsternberg Mar 11, 2024

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants