Troubleshooting
Some of the problems you could run into when using the agent, along with solutions
Failed to initialize NVML: Unknown Error
This error applies to any utilities/libraries that use NVML: pytorch
, nvidia-smi
, pynvml
etc. It frequently shows up unexpectedly and prevents applications from using the GPU until it is fixed.
Quick solution
If the error only appears inside the container, you can quickly fix it by restarting it. If the error also appears on the host after running nvidia-smi
you can fix it by using the reboot
command on the host.
Proper solution
If the error has not yet appeared, you can check if your system is affected by this problem.
run agent docker on the host PC;
run
sudo systemctl daemon-reload
on the host;execute
nvidia-smi
into the agent container and catchNVML initialization error
Set the parameter
"exec-opts": ["native.cgroupdriver=cgroupfs"]
in the/etc/docker/daemon.json
file.
Restart Docker with
sudo systemctl restart docker
You can also try other official NVIDIA fixes to solve this problem for a specific docker container or plunge into this problem by reading this discussion or this official description.
CUDA Out Of Memory Error
This error could appear in any training apps.
Solution
Check the amount of free GPU memory by running
nvidia-smi
command in your machine terminal - it will give you an understanding of how much GPU memory is it necessary to free in order to train your machine learning modelStop unnecessary app sessions in Supervisely: START button → App Sessions → stop all unnecessary app sessions by clicking on Stop button in front of every undesired app session
Stop unnecessary processes in your machine terminal by running
sudo kill <put_your_process_id_here>
Select a lighter machine learning model (check "Memory" column in a model table - there is information about how much GPU memory will this model require to train).
If this information is not provided, use a simple rule: the higher the model in the table, the lighter it is.
Reduce batch size or model input resolution
Additional: stop a process via docker.
run
docker ps
- it will return a big table with all docker containers running on this machinerun
docker stop <put_your_container_id_here>
Can't start the docker container. Trying to use another runtime.
This message indicates that there was a problem using the Nvidia runtime, most likely the Nvidia driver failed after an automatic kernel update - by default, this feature is enabled on Ubuntu. However, it often fails because the driver cannot be unloaded while it is in use.
Solution
Fast solution
The simplest way to fix this problem is to reboot
the machine. After the reboot, the Nvidia driver will be reloaded and the problem will be fixed.
But, if you don't want to reboot the machine, use the second solution.
Without rebooting
In case you receive: nvidia version mismatch
after executing nvidia-smi
command:
You can fix it without rebooting or reinstalling the driver using these commands:
The first command will kill all processes that use the Nvidia driver, and the last two will unload and reload the driver. You might need to redeploy your agent on the machine after running this command.
Additional: disable automatic kernel updates.
You can also run this command to disable automatic kernel updates:
If the commands above don't work for you (some process is auto restarting preventing the driver from properly unload), you can simply reboot the machine.
Last updated