Skip to content
This repository was archived by the owner on Oct 24, 2023. It is now read-only.

fix: make sure nvidia utility and compute bins can be mounted into co…#3029

Merged
jackfrancis merged 2 commits intoAzure:masterfrom
delulu:nvidia
Apr 10, 2020
Merged

fix: make sure nvidia utility and compute bins can be mounted into co…#3029
jackfrancis merged 2 commits intoAzure:masterfrom
delulu:nvidia

Conversation

@delulu
Copy link
Copy Markdown
Contributor

@delulu delulu commented Apr 6, 2020

Reason for Change:

Make sure nvidia utility and compute bins can be mounted into container as documented in nvidia-container-runtime readme.

Issue Fixed:

NVIDIA utility/compute bins are not mounted into containers when they're required in container env variable as NVIDIA_DRIVER_CAPABILITIES=compute,utility

Requirements:

Notes:

@welcome
Copy link
Copy Markdown

welcome bot commented Apr 6, 2020

💖 Thanks for opening your first pull request! 💖 We use semantic commit messages to streamline the release process. Before your pull request can be merged, you should make sure your first commit and PR title start with a semantic prefix. Examples of commit messages with semantic prefixes: - fix: change azure disk cachingMode to ReadOnly - feat: make maximumLoadBalancerRuleCount configurable - docs: add note on AKS Engine and AKS relationship
Make sure to check out the developer guide for guidance on testing your change.

@delulu
Copy link
Copy Markdown
Contributor Author

delulu commented Apr 6, 2020

more details can be found from #2837

@mboersma
Copy link
Copy Markdown
Member

mboersma commented Apr 6, 2020

/azp run pr-e2e

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@mboersma
Copy link
Copy Markdown
Member

mboersma commented Apr 6, 2020

@delulu could you run make generate and then commit and push the code that creates? You can make sure your PR will pass this check by running make test-style test locally.

retrycmd_if_failure 120 5 25 mount -t overlay -o lowerdir=/usr/lib/x86_64-linux-gnu,upperdir=${GPU_DEST}/lib64,workdir=${GPU_DEST}/overlay-workdir none /usr/lib/x86_64-linux-gnu || exit {{GetCSEErrorCode "ERR_GPU_DRIVERS_CONFIG"}}
retrycmd_if_failure 3 1 600 sh $GPU_DEST/nvidia-drivers-$GPU_DV --silent --accept-license --no-drm --dkms --utility-prefix="${GPU_DEST}" --opengl-prefix="${GPU_DEST}" || exit {{GetCSEErrorCode "ERR_GPU_DRIVERS_START_FAIL"}}
echo "${GPU_DEST}/lib64" >/etc/ld.so.conf.d/nvidia.conf
cp "${GPU_DEST}/bin" /usr/bin
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is special about /usr/bin as opposed to /usr/local/nvidia/bin?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvidia-container-runtime and nvidia-container-cli bins are located in /usr/bin.

It seems it will only search nvidia utility/compute bins in the same folder and then mounted them accordingly into container. Because I've tried to add /usr/local/nvidia into system path, and it does not work as expected.

I also opened an issue NVIDIA/nvidia-docker#1226, but there's no response yet.

So this's a quick fix for this issue.

@delulu
Copy link
Copy Markdown
Contributor Author

delulu commented Apr 7, 2020

@delulu could you run make generate and then commit and push the code that creates? You can make sure your PR will pass this check by running make test-style test locally.

I tried, but there's some unexpected issue as shown below, maybe some more settings are required to setup local box. So could you help run it at your side, thx!

delzh@MININT-DELU:/mnt/e/git/aks-engine$ make test-style test

==> Running Go linter <==
golangci-lint has version 1.23.7 built from 96964db on 2020-02-28T12:07:44Z
==> Running shell linter <==
ShellCheck - shell script analysis tool
version: 0.3.7
license: GNU Affero General Public License, version 3
website: http://www.shellcheck.net

In ./parts/k8s/cloud-init/artifacts/cse_customcloud.sh line 63:
      jq --arg K8S_CLIENT_CERT_PATH ${K8S_CLIENT_CERT_PATH} '. + {aadClientCertPath:($K8S_CLIENT_CERT_PATH)}' |
                                                            ^-- SC2016: Expressions don't expand in single quotes, use double quotes for that.
...
Makefile:79: recipe for target 'validate-shell' failed
make: *** [validate-shell] Error 1

@msftclas
Copy link
Copy Markdown

msftclas commented Apr 7, 2020

CLA assistant check
All CLA requirements met.

@mboersma
Copy link
Copy Markdown
Member

mboersma commented Apr 7, 2020

@delulu no problem, I did make generate and pushed a commit.

I think the make test-style failure was because we assume shellcheck 0.7.0 but you have version 0.3.7. You can do make dev to run commands from our toolbox container, which has that version, but we don't have a way to enforce tool versions on workstations in general.

@jackfrancis
Copy link
Copy Markdown
Member

/azp run pr-e2e

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

retrycmd_if_failure 120 5 25 mount -t overlay -o lowerdir=/usr/lib/x86_64-linux-gnu,upperdir=${GPU_DEST}/lib64,workdir=${GPU_DEST}/overlay-workdir none /usr/lib/x86_64-linux-gnu || exit {{GetCSEErrorCode "ERR_GPU_DRIVERS_CONFIG"}}
retrycmd_if_failure 3 1 600 sh $GPU_DEST/nvidia-drivers-$GPU_DV --silent --accept-license --no-drm --dkms --utility-prefix="${GPU_DEST}" --opengl-prefix="${GPU_DEST}" || exit {{GetCSEErrorCode "ERR_GPU_DRIVERS_START_FAIL"}}
echo "${GPU_DEST}/lib64" >/etc/ld.so.conf.d/nvidia.conf
cp "${GPU_DEST}/bin/*" /usr/bin
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be creating a second copy of the file, should we create a symlink at /usr/bin/nvidia-smi to /usr/local/nvidia/bin/nvidia-smi instead?

Copy link
Copy Markdown
Contributor Author

@delulu delulu Apr 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually we can mv "${GPU_DEST}/bin/*" /usr/bin if you has this concern, since utility libraries have been added to system library path with echo "${GPU_DEST}/lib64" >/etc/ld.so.conf.d/nvidia.conf.

It will work with a symlink, but there're also other utility bins and compute bins in ${GPU_DEST}/bin/*, they also might be required in gpu containers, and I would like to keep the change small to not introduce any potential issue.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mv works for me, thanks!

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 8, 2020

Codecov Report

Merging #3029 into master will not change coverage by %.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #3029   +/-   ##
=======================================
  Coverage   70.63%   70.63%           
=======================================
  Files         145      145           
  Lines       25151    25151           
=======================================
  Hits        17765    17765           
  Misses       6283     6283           
  Partials     1103     1103           
Impacted Files Coverage Δ
pkg/engine/templates_generated.go 31.57% <ø> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5218a91...ec926bb. Read the comment docs.

@delulu
Copy link
Copy Markdown
Contributor Author

delulu commented Apr 10, 2020

updated per comments, please have a further check.

@delulu
Copy link
Copy Markdown
Contributor Author

delulu commented Apr 10, 2020

/azp run pr-e2e

@azure-pipelines
Copy link
Copy Markdown

Commenter does not have sufficient privileges for PR 3029 in repo Azure/aks-engine

@mboersma
Copy link
Copy Markdown
Member

/azp run pr-e2e

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Copy Markdown
Member

@jackfrancis jackfrancis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@jackfrancis
Copy link
Copy Markdown
Member

@yangl900 @andyliuliming FYI

@jackfrancis jackfrancis merged commit 8fb97a2 into Azure:master Apr 10, 2020
@welcome
Copy link
Copy Markdown

welcome bot commented Apr 10, 2020

Congrats on merging your first pull request! 🎉🎉🎉

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants