fix: make sure nvidia utility and compute bins can be mounted into co…#3029
fix: make sure nvidia utility and compute bins can be mounted into co…#3029jackfrancis merged 2 commits intoAzure:masterfrom
Conversation
|
💖 Thanks for opening your first pull request! 💖 We use semantic commit messages to streamline the release process. Before your pull request can be merged, you should make sure your first commit and PR title start with a semantic prefix. Examples of commit messages with semantic prefixes: - |
|
more details can be found from #2837 |
|
/azp run pr-e2e |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
@delulu could you run |
| retrycmd_if_failure 120 5 25 mount -t overlay -o lowerdir=/usr/lib/x86_64-linux-gnu,upperdir=${GPU_DEST}/lib64,workdir=${GPU_DEST}/overlay-workdir none /usr/lib/x86_64-linux-gnu || exit {{GetCSEErrorCode "ERR_GPU_DRIVERS_CONFIG"}} | ||
| retrycmd_if_failure 3 1 600 sh $GPU_DEST/nvidia-drivers-$GPU_DV --silent --accept-license --no-drm --dkms --utility-prefix="${GPU_DEST}" --opengl-prefix="${GPU_DEST}" || exit {{GetCSEErrorCode "ERR_GPU_DRIVERS_START_FAIL"}} | ||
| echo "${GPU_DEST}/lib64" >/etc/ld.so.conf.d/nvidia.conf | ||
| cp "${GPU_DEST}/bin" /usr/bin |
There was a problem hiding this comment.
What is special about /usr/bin as opposed to /usr/local/nvidia/bin?
There was a problem hiding this comment.
nvidia-container-runtime and nvidia-container-cli bins are located in /usr/bin.
It seems it will only search nvidia utility/compute bins in the same folder and then mounted them accordingly into container. Because I've tried to add /usr/local/nvidia into system path, and it does not work as expected.
I also opened an issue NVIDIA/nvidia-docker#1226, but there's no response yet.
So this's a quick fix for this issue.
I tried, but there's some unexpected issue as shown below, maybe some more settings are required to setup local box. So could you help run it at your side, thx! delzh@MININT-DELU:/mnt/e/git/aks-engine$ make test-style test
==> Running Go linter <==
golangci-lint has version 1.23.7 built from 96964db on 2020-02-28T12:07:44Z
==> Running shell linter <==
ShellCheck - shell script analysis tool
version: 0.3.7
license: GNU Affero General Public License, version 3
website: http://www.shellcheck.net
In ./parts/k8s/cloud-init/artifacts/cse_customcloud.sh line 63:
jq --arg K8S_CLIENT_CERT_PATH ${K8S_CLIENT_CERT_PATH} '. + {aadClientCertPath:($K8S_CLIENT_CERT_PATH)}' |
^-- SC2016: Expressions don't expand in single quotes, use double quotes for that.
...
Makefile:79: recipe for target 'validate-shell' failed
make: *** [validate-shell] Error 1 |
|
@delulu no problem, I did I think the |
|
/azp run pr-e2e |
|
Azure Pipelines successfully started running 1 pipeline(s). |
| retrycmd_if_failure 120 5 25 mount -t overlay -o lowerdir=/usr/lib/x86_64-linux-gnu,upperdir=${GPU_DEST}/lib64,workdir=${GPU_DEST}/overlay-workdir none /usr/lib/x86_64-linux-gnu || exit {{GetCSEErrorCode "ERR_GPU_DRIVERS_CONFIG"}} | ||
| retrycmd_if_failure 3 1 600 sh $GPU_DEST/nvidia-drivers-$GPU_DV --silent --accept-license --no-drm --dkms --utility-prefix="${GPU_DEST}" --opengl-prefix="${GPU_DEST}" || exit {{GetCSEErrorCode "ERR_GPU_DRIVERS_START_FAIL"}} | ||
| echo "${GPU_DEST}/lib64" >/etc/ld.so.conf.d/nvidia.conf | ||
| cp "${GPU_DEST}/bin/*" /usr/bin |
There was a problem hiding this comment.
This seems to be creating a second copy of the file, should we create a symlink at /usr/bin/nvidia-smi to /usr/local/nvidia/bin/nvidia-smi instead?
There was a problem hiding this comment.
Actually we can mv "${GPU_DEST}/bin/*" /usr/bin if you has this concern, since utility libraries have been added to system library path with echo "${GPU_DEST}/lib64" >/etc/ld.so.conf.d/nvidia.conf.
It will work with a symlink, but there're also other utility bins and compute bins in ${GPU_DEST}/bin/*, they also might be required in gpu containers, and I would like to keep the change small to not introduce any potential issue.
Codecov Report
@@ Coverage Diff @@
## master #3029 +/- ##
=======================================
Coverage 70.63% 70.63%
=======================================
Files 145 145
Lines 25151 25151
=======================================
Hits 17765 17765
Misses 6283 6283
Partials 1103 1103
Continue to review full report at Codecov.
|
|
updated per comments, please have a further check. |
|
/azp run pr-e2e |
|
Commenter does not have sufficient privileges for PR 3029 in repo Azure/aks-engine |
|
/azp run pr-e2e |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
Congrats on merging your first pull request! 🎉🎉🎉 |
Reason for Change:
Make sure nvidia utility and compute bins can be mounted into container as documented in nvidia-container-runtime readme.
Issue Fixed:
NVIDIA utility/compute bins are not mounted into containers when they're required in container env variable as
NVIDIA_DRIVER_CAPABILITIES=compute,utilityRequirements:
Notes: