A command-line diagnostic tool for GPU health monitoring and troubleshooting. This tool helps identify and diagnose common GPU issues, including memory leaks, hardware failures, and performance degradation.
- Real-time GPU health monitoring
- Memory leak detection
- Hardware failure diagnosis
- Performance metrics analysis
- Mock testing capabilities for development
Run tests in docker container:
make test
Run tests locally and generate coverage report:
make test-local
If you are developing on MacOS, you can consider using a docker container for compilation.
Taking the ubuntu:22.04
image as an example, you need to install the following dependencies in the container and mount the project into the container for compilation.
- Start the container
docker run --platform=linux/amd64 -itd -v ./ai-accelerator-tool:/git/src/github.com/aibrix/ai-accelerator-tool/ ubuntu:22.04
- Install dependencies in the container
apt update
apt install -y vim cmake clang libnvidia-ml-dev git wget
wget https://go.dev/dl/go1.23.2.linux-amd64.tar.gz
tar xvf go1.23.2.linux-amd64.tar.gz
echo "export PATH=$PATH:/go/bin" >> ~/.bashrc
source ~/.bashrc
- Compile the project in the container
cd /git/src/github.com/aibrix/ai-accelerator-tool
git submodule update --init --recursive
make lib-injection
cp lib/build/lib/libdevso-injection.so pkg/mock/resources/injectiond.so
GOOS=linux GOARCH=amd64 make build
The binary will be generated in bin/
.
# Set the number of GPU cards in the machine, for example, 4.
export GPU_CARD_COUNT=4
# Run the diagnosis.
ai-accelerator-tool diagnose
Note:
- This tool requires the
nvidia-smi
command to be installed.
You can refer to the comments in hack/gpu_mock_conf.toml
to configure the fault scenario.
ai-accelerator-tool mock --config /PATH/TO/gpu_mock_conf.toml
mkdir -p /opt/gpu_mock && cd /opt/gpu_mock/
cp /PATH/TO/nvml_injectiond.so /opt/gpu_mock/
cp /PATH/TO/gpu_mock_conf.toml /opt/gpu_mock/
echo "/opt/gpu_mock/nvml_injectiond.so" >> /etc/ld.so.preload