Skip to content

Conversation

@Shaggy84675
Copy link
Contributor

@Shaggy84675 Shaggy84675 commented Apr 2, 2025

Fixes #49709

- What I did
I've fixed a bug where docker stats was unable to collect stats on a machine with large amount of CPU cores and interrupts.

- How I did
I've changed scanner for reader to allow larger lines parsing (default 64kB scanner buffer) from /proc/stat. Scanner has of course option to increase the internal buffer size, but we wanted to avoid increasing the buffer and wasting memory just for a lines that are discarded anyway.

Another issue was if scanner threw an error, there's no way how to read at least the part that fits into the buffer, so it wasn't possisble to determine where the error exactly happened.

ReadLine function has a flag isPrefix that is set to true if the line is longer and doesn't fit into internal buffer. Based on that we can skip further processing.

- How to verify it
On a machine with large amount of interrupts (where the line exceeds 64kB in /proc/stat) run command docker stats.

Fix `docker stats` not working properly on machines with high CPU core count

- A picture of a cute animal (not mandatory but encouraged)

20200127_170347(2)

Copy link
Contributor

@vvoland vvoland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

I left some comments.

Also, could you please add unit tests that test this function against the example file contents we provided?

#49709 (comment)
#49709 (comment)

rdr := bufio.NewReader(f)

for {
data, isPrefix, err := rdr.ReadLine()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isPrefix is a bit weird name - looking at the code it isn't exactly obvious what it actually is.
Perhaps we could rename it to:

Suggested change
data, isPrefix, err := rdr.ReadLine()
data, tooLong, err := rdr.ReadLine()

What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's its name in https://pkg.go.dev/bufio#Reader.ReadLine

But, it'll be set for each partial read - then not-set on the final read of the long line. The data from that final read needs to be discarded too.

It'd be best to discard long lines before converting the bytes to a string, there's no need to check whether a line begins with "cpu" if the line's too long.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's its name in https://pkg.go.dev/bufio#Reader.ReadLine

But, it'll be set for each partial read - then not-set on the final read of the long line. The data from that final read needs to be discarded too.

It'd be best to discard long lines before converting the bytes to a string, there's no need to check whether a line begins with "cpu" if the line's too long.

Good catch! I've separated the condition into two and skipping any further processing if partial read is true. The only trouble happens a bit with the last read... Didn't know it works this way.

Maybe it is alright it's converted into string and then discarded in the other condition that checks for cpu string? Since it's only one line with small size. Otherwise I'd have to probably add some kind of helper variable or nested loop that'd add more complexity.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I'm sorry - I completely misread it, and missed the assumption that cpu lines are first in the file. Thank you for moving the string(data), I think that's slightly better - but not the issue I first thought! So, I think it can just be:

// Assume all cpu* records are at the start of the file, like glibc:
// https://github.com/bminor/glibc/blob/5d00c201b9a2da768a79ea8d5311f257871c0b43/sysdeps/unix/sysv/linux/getsysstats.c#L108-L135
if isPartial || len(data) < 4 {
	break
}
line := string(data)
if line[:3] != "cpu" {
	break
}

(Please do squash the commits though.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, fixed!

I think I've messed up squashing the commits.

@vvoland vvoland added this to the 28.0.5 milestone Apr 3, 2025
@vvoland vvoland moved this from New to Needs author feedback in Issue Triage Apr 3, 2025
@Shaggy84675
Copy link
Contributor Author

Thanks!

I left some comments.

Also, could you please add unit tests that test this function against the example file contents we provided?

#49709 (comment) #49709 (comment)

Yes, I'll try my best to create the unit tests... Never did it so it takes me a bit to figure out how to do them properly... :)
Also I suppose there aren't any existing tests for this yet, that'd need to be modified right?

@Shaggy84675 Shaggy84675 force-pushed the 49709-fix_system_cpu_usage_stat branch 2 times, most recently from 84379a3 to c294f39 Compare April 5, 2025 18:05
@Shaggy84675
Copy link
Contributor Author

Thanks!

I left some comments.

Also, could you please add unit tests that test this function against the example file contents we provided?

#49709 (comment) #49709 (comment)

I've added the unit test. Please let me know if any improvements are needed.

I was also thinking if we should add more unit tests if the /proc/stat structure is different. But then I thought that may never happens i guess?

Copy link
Contributor

@robmry robmry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - thank you!

@robmry
Copy link
Contributor

robmry commented Apr 7, 2025

I was also thinking if we should add more unit tests if the /proc/stat structure is different. But then I thought that may never happens i guess?

It wouldn't do any harm to test the error paths - but as it is, the change is already an improvement.

@robmry robmry force-pushed the 49709-fix_system_cpu_usage_stat branch from c294f39 to 302bf7c Compare April 7, 2025 08:27
@robmry
Copy link
Contributor

robmry commented Apr 7, 2025

(Rebased to sort out the validate-vendor test.)

@robmry
Copy link
Contributor

robmry commented Apr 7, 2025

Oh, for the Windows unit tests, daemon\stats_unix_test.go needs //go:build !windows at the top of the file ...

daemon\stats_unix_test.go:16:18: undefined: procStatPath
daemon\stats_unix_test.go:17:2: undefined: procStatPath
daemon\stats_unix_test.go:18:17: undefined: procStatPath

Copy link
Contributor

@vvoland vvoland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And a couple of non-blocking suggestions 😄

Not really blockers though, so can be left for a follow up.

// information.
func getSystemCPUUsage() (cpuUsage uint64, cpuNum uint32, _ error) {
f, err := os.Open("/proc/stat")
f, err := os.Open(procStatPath)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could split this into two functions:

func getSystemCPUUsage() (cpuUsage uint64, cpuNum uint32, _ error) {
	f, err := os.Open("/proc/stat")
	if err != nil {
		return 0, 0, err
	}
	defer f.Close()

    return readSystemCPUUsage(f)
}
func readSystemCPUUsage(r io.Reader) (cpuUsage uint64, cpuNum uint32, _ error) {
    rdr := bufio.NewReaderSize(r, 1024)
    ...
}

Comment on lines +12 to +20
dummyFilePath := filepath.Join("testdata", "stat")
expectedCpuUsage := uint64(65647090000000)
expectedCpuNum := uint32(128)

origStatPath := procStatPath
procStatPath = dummyFilePath
defer func() { procStatPath = origStatPath }()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This way we can avoid replacing the file path and inline the testdata in the test:

Suggested change
dummyFilePath := filepath.Join("testdata", "stat")
expectedCpuUsage := uint64(65647090000000)
expectedCpuNum := uint32(128)
origStatPath := procStatPath
procStatPath = dummyFilePath
defer func() { procStatPath = origStatPath }()
input := strings.NewReader(```
<stat content>
```)

_, err := os.Stat(dummyFilePath)
assert.NilError(t, err)

cpuUsage, cpuNum, err := getSystemCPUUsage()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And then test the function that allows passing the file content directly:

Suggested change
cpuUsage, cpuNum, err := getSystemCPUUsage()
cpuUsage, cpuNum, err := readSystemCPUUsage(input)

Comment on lines +25 to +28
assert.Equal(t, cpuUsage, expectedCpuUsage)
assert.Equal(t, cpuNum, expectedCpuNum)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can inline expectedCpuUsage and expectedCpuNum and use cmp assertions:

Suggested change
assert.Equal(t, cpuUsage, expectedCpuUsage)
assert.Equal(t, cpuNum, expectedCpuNum)
assert.Check(t, is.Equal(cpuUsage, uint64(65647090000000)))
assert.Check(t, is.Equal(cpuNum, uint32(128)))

@Shaggy84675 Shaggy84675 force-pushed the 49709-fix_system_cpu_usage_stat branch from fa20c22 to 5c0c8ce Compare April 7, 2025 14:17
This fix address issues where the scanner was unable to properly parse longer outputs from /proc/stat. This could happen on an ARM machine with large amount of CPU cores (and interrupts). By switching to reader we have more control over data parsing and dump unnecessary data

Signed-off-by: Patrik Leifert <[email protected]>
@Shaggy84675 Shaggy84675 force-pushed the 49709-fix_system_cpu_usage_stat branch from 5c0c8ce to e22d04e Compare April 7, 2025 14:19
@Shaggy84675
Copy link
Contributor Author

Oh, for the Windows unit tests, daemon\stats_unix_test.go needs //go:build !windows at the top of the file ...

daemon\stats_unix_test.go:16:18: undefined: procStatPath
daemon\stats_unix_test.go:17:2: undefined: procStatPath
daemon\stats_unix_test.go:18:17: undefined: procStatPath

@robmry @vvoland fixed. I think I've messed up your previous rebase, so you might need to do it again... I'm still fighting with it a bit sometimes 🙈

@thompson-shaun thompson-shaun modified the milestones: 28.0.5, 28.1.0 Apr 7, 2025
@vvoland vvoland moved this from Needs author feedback to Needs maintainer feedback in Issue Triage Apr 7, 2025
@robmry
Copy link
Contributor

robmry commented Apr 7, 2025

We can ignore the failed unit tests - TestEndpointStore is fixed by #49764

Copy link
Contributor

@vvoland vvoland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@vvoland vvoland merged commit 8327848 into moby:master Apr 7, 2025
146 of 148 checks passed
@vvoland vvoland mentioned this pull request Apr 7, 2025
@vvoland vvoland moved this from Needs maintainer feedback to Accepted in Issue Triage Apr 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

No open projects
Status: Todo

Development

Successfully merging this pull request may close these issues.

Docker stats doesn't work on 128 CPUs core system (error scanning '/proc/stat' file: bufio.Scanner: token too long")

4 participants