Skip to content

Conversation

@jsternberg
Copy link
Collaborator

This measures the amount of time spent idle during the build. This is
done by collecting the set of task times, determining which sections
contain gaps where no task is running, and aggregating that duration
into a metric.

@jsternberg jsternberg requested review from daghack and tonistiigi March 7, 2024 20:53
mr.mu.Lock()
defer mr.mu.Unlock()

sort.Slice(mr.Started, func(i, j int) bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this logic to a separate function that can be unit-tested so it calculates the expected idle time.

defer mr.mu.Unlock()

for _, v := range ss.Vertexes {
if v.Started == nil || v.Completed == nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible to get new update on vertex that already had completed time. I don't think it changes the output atm. though.

}

func (mr *idleMetricRecorder) calculateIdleTime(ctx context.Context, o metric.Float64Observer) error {
mr.mu.Lock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are the locks needed in here? Can it be called multiple times? I think that would make the calculation incorrect. Eg.

A 0 - 2
B 1 - 10
C 3 - 5

If the calculation happens on 6 then B has no completed time yet and 2-3 looks like idle, while it really is not.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The locks are mostly just to ensure that there's no race condition. The way OTEL observation functions work is they're invoked when the Collect method is called. For normal service programs, this can be done automatically with the periodic reader or through something like prometheus metrics.

For this one, the only time this realistically gets invoked is at the program end when the manual reader calls collect which happens after the build is over.

I figured though that it's theoretically possible to be invoked from at any time and locks would be a low-cost solution to ensuring this didn't cause any race conditions inadvertently.

For the actual calculation, yes it could feasibly be called multiple times but the way we have things set up it won't be. It also doesn't use now to calculate idle time. The metric is also a gauge so the number can go up or down.

It would be possible to get the calculation to be correct for the situation you described, but it probably just wouldn't be worth the extra logic.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we just force one metric send for this for each build (and maybe for others as well that are strictly 1-1)? I think it complicates the understanding if we have code has no guarantees how often and when it gets called while really we have only one specific place where we produce the event. Builds can be very long so I don't know if there are any actual guarantees that there will never be periodic updates, and if there are they are hard to follow. In this case it would cause wrong data, in the others it might not be that obvious what is happening in all possible cases.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metric "sends" and metric gathering are separated from each other by the OTEL library. The exporter is the one responsible for sending and it's the one that determines how often that happens or if it even happens at all.

In OTEL, the resource attributes and the aggregation behavior are mostly what determine how this works. For a counter, which is what the other metrics are, it's a monotonically increasing sum so it's easier for us to just add values to it. But, it would still technically be possible for the reader to retrieve what the current counters are at any time.

For this metric, I chose a gauge to represent the value. The aggregation behavior for gauges is to always take the last value. The value can go up or down.

I still think this is correct, but if we want to make sure that this is always "correct", I can also factor in the start times for in progress vertices to avoid the problem you mentioned in the first comment. That way, we don't consider 2-3 an idle time because we have a vertex that started at 1 and it has no stop time at the moment.

@jsternberg jsternberg force-pushed the build-idle-time-metric branch 2 times, most recently from 2868fed to 337a88b Compare March 15, 2024 20:35
This measures the amount of time spent idle during the build. This is
done by collecting the set of task times, determining which sections
contain gaps where no task is running, and aggregating that duration
into a metric.

Signed-off-by: Jonathan A. Sternberg <[email protected]>
@jsternberg jsternberg force-pushed the build-idle-time-metric branch from 337a88b to a4a8846 Compare March 18, 2024 13:43
@tonistiigi tonistiigi merged commit 4af0ed5 into docker:master Mar 18, 2024
@jsternberg jsternberg deleted the build-idle-time-metric branch March 19, 2024 12:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants