Fix docker stats parsing with large amount of interrupts #49734

Shaggy84675 · 2025-04-02T19:32:17Z

- What I did
I've fixed a bug where docker stats was unable to collect stats on a machine with large amount of CPU cores and interrupts.

- How I did
I've changed scanner for reader to allow larger lines parsing (default 64kB scanner buffer) from /proc/stat. Scanner has of course option to increase the internal buffer size, but we wanted to avoid increasing the buffer and wasting memory just for a lines that are discarded anyway.

Another issue was if scanner threw an error, there's no way how to read at least the part that fits into the buffer, so it wasn't possisble to determine where the error exactly happened.

ReadLine function has a flag isPrefix that is set to true if the line is longer and doesn't fit into internal buffer. Based on that we can skip further processing.

- How to verify it
On a machine with large amount of interrupts (where the line exceeds 64kB in /proc/stat) run command docker stats.

Fix `docker stats` not working properly on machines with high CPU core count

- A picture of a cute animal (not mandatory but encouraged)

vvoland

Thanks!

I left some comments.

Also, could you please add unit tests that test this function against the example file contents we provided?

#49709 (comment)
#49709 (comment)

daemon/stats_unix.go

vvoland · 2025-04-03T13:13:24Z

daemon/stats_unix.go

+	rdr := bufio.NewReader(f)
+
+	for {
+		data, isPrefix, err := rdr.ReadLine()


isPrefix is a bit weird name - looking at the code it isn't exactly obvious what it actually is.
Perhaps we could rename it to:

Suggested change

data, isPrefix, err := rdr.ReadLine()

data, tooLong, err := rdr.ReadLine()

What do you think?

That's its name in https://pkg.go.dev/bufio#Reader.ReadLine

But, it'll be set for each partial read - then not-set on the final read of the long line. The data from that final read needs to be discarded too.

It'd be best to discard long lines before converting the bytes to a string, there's no need to check whether a line begins with "cpu" if the line's too long.

That's its name in https://pkg.go.dev/bufio#Reader.ReadLine

But, it'll be set for each partial read - then not-set on the final read of the long line. The data from that final read needs to be discarded too.

It'd be best to discard long lines before converting the bytes to a string, there's no need to check whether a line begins with "cpu" if the line's too long.

Good catch! I've separated the condition into two and skipping any further processing if partial read is true. The only trouble happens a bit with the last read... Didn't know it works this way.

Maybe it is alright it's converted into string and then discarded in the other condition that checks for cpu string? Since it's only one line with small size. Otherwise I'd have to probably add some kind of helper variable or nested loop that'd add more complexity.

Oh, I'm sorry - I completely misread it, and missed the assumption that cpu lines are first in the file. Thank you for moving the string(data), I think that's slightly better - but not the issue I first thought! So, I think it can just be:

// Assume all cpu* records are at the start of the file, like glibc: // https://github.com/bminor/glibc/blob/5d00c201b9a2da768a79ea8d5311f257871c0b43/sysdeps/unix/sysv/linux/getsysstats.c#L108-L135 if isPartial || len(data) < 4 { break } line := string(data) if line[:3] != "cpu" { break }

(Please do squash the commits though.)

Thank you, fixed!

I think I've messed up squashing the commits.

daemon/stats_unix.go

Shaggy84675 · 2025-04-03T16:51:16Z

Thanks!

I left some comments.

Also, could you please add unit tests that test this function against the example file contents we provided?

#49709 (comment) #49709 (comment)

Yes, I'll try my best to create the unit tests... Never did it so it takes me a bit to figure out how to do them properly... :)
Also I suppose there aren't any existing tests for this yet, that'd need to be modified right?

Shaggy84675 · 2025-04-05T18:09:17Z

Thanks!

I left some comments.

Also, could you please add unit tests that test this function against the example file contents we provided?

#49709 (comment) #49709 (comment)

I've added the unit test. Please let me know if any improvements are needed.

I was also thinking if we should add more unit tests if the /proc/stat structure is different. But then I thought that may never happens i guess?

robmry

LGTM - thank you!

robmry · 2025-04-07T08:15:56Z

I was also thinking if we should add more unit tests if the /proc/stat structure is different. But then I thought that may never happens i guess?

It wouldn't do any harm to test the error paths - but as it is, the change is already an improvement.

robmry · 2025-04-07T08:31:46Z

(Rebased to sort out the validate-vendor test.)

robmry · 2025-04-07T09:19:00Z

Oh, for the Windows unit tests, daemon\stats_unix_test.go needs //go:build !windows at the top of the file ...

daemon\stats_unix_test.go:16:18: undefined: procStatPath
daemon\stats_unix_test.go:17:2: undefined: procStatPath
daemon\stats_unix_test.go:18:17: undefined: procStatPath

daemon/stats_unix_test.go

vvoland

And a couple of non-blocking suggestions 😄

Not really blockers though, so can be left for a follow up.

vvoland · 2025-04-07T09:24:03Z

daemon/stats_unix.go

 // information.
 func getSystemCPUUsage() (cpuUsage uint64, cpuNum uint32, _ error) {
-	f, err := os.Open("/proc/stat")
+	f, err := os.Open(procStatPath)


We could split this into two functions:

func getSystemCPUUsage() (cpuUsage uint64, cpuNum uint32, _ error) { f, err := os.Open("/proc/stat") if err != nil { return 0, 0, err } defer f.Close() return readSystemCPUUsage(f) }

func readSystemCPUUsage(r io.Reader) (cpuUsage uint64, cpuNum uint32, _ error) { rdr := bufio.NewReaderSize(r, 1024) ... }

vvoland · 2025-04-07T09:25:50Z

daemon/stats_unix_test.go

+	dummyFilePath := filepath.Join("testdata", "stat")
+	expectedCpuUsage := uint64(65647090000000)
+	expectedCpuNum := uint32(128)
+
+	origStatPath := procStatPath
+	procStatPath = dummyFilePath
+	defer func() { procStatPath = origStatPath }()


This way we can avoid replacing the file path and inline the testdata in the test:

Suggested change

dummyFilePath := filepath.Join("testdata", "stat")

expectedCpuUsage := uint64(65647090000000)

expectedCpuNum := uint32(128)

origStatPath := procStatPath

procStatPath = dummyFilePath

defer func() { procStatPath = origStatPath }()

input := strings.NewReader(```

<stat content>

```)

vvoland · 2025-04-07T09:26:28Z

daemon/stats_unix_test.go

+	_, err := os.Stat(dummyFilePath)
+	assert.NilError(t, err)
+
+	cpuUsage, cpuNum, err := getSystemCPUUsage()


And then test the function that allows passing the file content directly:

Suggested change

cpuUsage, cpuNum, err := getSystemCPUUsage()

cpuUsage, cpuNum, err := readSystemCPUUsage(input)

vvoland · 2025-04-07T09:27:42Z

daemon/stats_unix_test.go

+	assert.Equal(t, cpuUsage, expectedCpuUsage)
+	assert.Equal(t, cpuNum, expectedCpuNum)


We can inline expectedCpuUsage and expectedCpuNum and use cmp assertions:

Suggested change

assert.Equal(t, cpuUsage, expectedCpuUsage)

assert.Equal(t, cpuNum, expectedCpuNum)

assert.Check(t, is.Equal(cpuUsage, uint64(65647090000000)))

assert.Check(t, is.Equal(cpuNum, uint32(128)))

This fix address issues where the scanner was unable to properly parse longer outputs from /proc/stat. This could happen on an ARM machine with large amount of CPU cores (and interrupts). By switching to reader we have more control over data parsing and dump unnecessary data Signed-off-by: Patrik Leifert <[email protected]>

Shaggy84675 · 2025-04-07T14:21:57Z

Oh, for the Windows unit tests, daemon\stats_unix_test.go needs //go:build !windows at the top of the file ...
daemon\stats_unix_test.go:16:18: undefined: procStatPath
daemon\stats_unix_test.go:17:2: undefined: procStatPath
daemon\stats_unix_test.go:18:17: undefined: procStatPath

@robmry @vvoland fixed. I think I've messed up your previous rebase, so you might need to do it again... I'm still fighting with it a bit sometimes 🙈

robmry · 2025-04-07T16:03:20Z

We can ignore the failed unit tests - TestEndpointStore is fixed by #49764

vvoland

Thanks!

vvoland reviewed Apr 3, 2025

View reviewed changes

vvoland added status/2-code-review kind/bugfix PR's that fix bugs area/runtime Runtime labels Apr 3, 2025

vvoland added this to the 28.0.5 milestone Apr 3, 2025

vvoland added this to Issue Triage Apr 3, 2025

github-project-automation bot moved this to New in Issue Triage Apr 3, 2025

vvoland moved this from New to Needs author feedback in Issue Triage Apr 3, 2025

Shaggy84675 force-pushed the 49709-fix_system_cpu_usage_stat branch 2 times, most recently from 84379a3 to c294f39 Compare April 5, 2025 18:05

robmry approved these changes Apr 7, 2025

View reviewed changes

robmry force-pushed the 49709-fix_system_cpu_usage_stat branch from c294f39 to 302bf7c Compare April 7, 2025 08:27

vvoland requested changes Apr 7, 2025

View reviewed changes

daemon/stats_unix_test.go Show resolved Hide resolved

vvoland reviewed Apr 7, 2025

View reviewed changes

Shaggy84675 force-pushed the 49709-fix_system_cpu_usage_stat branch from fa20c22 to 5c0c8ce Compare April 7, 2025 14:17

Shaggy84675 force-pushed the 49709-fix_system_cpu_usage_stat branch from 5c0c8ce to e22d04e Compare April 7, 2025 14:19

thompson-shaun modified the milestones: 28.0.5, 28.1.0 Apr 7, 2025

vvoland moved this from Needs author feedback to Needs maintainer feedback in Issue Triage Apr 7, 2025

robmry approved these changes Apr 7, 2025

View reviewed changes

vvoland approved these changes Apr 7, 2025

View reviewed changes

vvoland merged commit 8327848 into moby:master Apr 7, 2025
146 of 148 checks passed

vvoland added the impact/changelog label Apr 7, 2025

vvoland mentioned this pull request Apr 7, 2025

docker stats follow-up #49768

Closed

vvoland moved this from Needs maintainer feedback to Accepted in Issue Triage Apr 7, 2025

	data, isPrefix, err := rdr.ReadLine()
	data, tooLong, err := rdr.ReadLine()

	cpuUsage, cpuNum, err := getSystemCPUUsage()
	cpuUsage, cpuNum, err := readSystemCPUUsage(input)

		assert.Equal(t, cpuUsage, expectedCpuUsage)
		assert.Equal(t, cpuNum, expectedCpuNum)

Fix docker stats parsing with large amount of interrupts #49734

Fix docker stats parsing with large amount of interrupts #49734

Uh oh!

Conversation

Shaggy84675 commented Apr 2, 2025 • edited by vvoland Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vvoland left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Shaggy84675 commented Apr 3, 2025

Uh oh!

Shaggy84675 commented Apr 5, 2025

Uh oh!

robmry left a comment

Choose a reason for hiding this comment

Uh oh!

robmry commented Apr 7, 2025

Uh oh!

robmry commented Apr 7, 2025

Uh oh!

robmry commented Apr 7, 2025

Uh oh!

Uh oh!

vvoland left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Shaggy84675 commented Apr 7, 2025

Uh oh!

robmry commented Apr 7, 2025

Uh oh!

vvoland left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Shaggy84675 commented Apr 2, 2025 •

edited by vvoland

Loading