Under Linux systems where X runs by default on the target GPU the kernel mode driver will generally be initalized and kept alive from machine startup to shutdown, courtesy of the X process. On headless systems or situations where no long-lived X-like client maintains a handle to the target GPU, the kernel mode driver will initilize and deinitialize the target GPU each time a target GPU application starts and stops. In HPC environments this situation is quite common. Since it is often desireable to keep the GPU initialized in these cases, NVIDIA provides two options for changing driver behavior: Persistence Mode (Legacy) and the Persistence Daemon.
安装完成驱动之后,nvidia-smi这个命令就可以使用了,但是还是会报错Failed to initialize NVML: Driver/library version mismatch,这是因为重新安装了显卡驱动之后需要重启一下系统,才可以正常work。
安装cuda
但是因为之前把cuda也都一起卸载了,应用程序跑不起来,需要我们重新安装一下cuda.
1 2
wget https://developer.download.nvidia.com/compute/cuda/10.2/Prod/local_installers/cuda_10.2.89_440.33.01_linux.run sudo sh cuda_10.2.89_440.33.01_linux.run
~/NVIDIA_CUDA-10.1_Samples/0_Simple/vectorAdd$ ./vectorAdd [Vector addition of 50000 elements] Copy input data from the host memory to the CUDA device CUDA kernel launch with 196 blocks of 256 threads Copy output data from the CUDA device to the host memory Test PASSED Done
nvidia-docker
虽然在主机上cuda程序已经可以正常工作了,但是在启动docker的时候还是会出现提示,no CUDA-capable devices were detected,在docker里运行cuda程序还是不行,运行nvidia-smi的时候,cuda verison 显示为 N/A。
Sun May 29 05:59:31 2022 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: N/A | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla V100-PCIE... On | 00005B0A:00:00.0 Off | Off | | N/A 31C P0 24W / 250W | 0MiB / 16160MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 Tesla V100-PCIE... On | 00009E65:00:00.0 Off | Off | | N/A 34C P0 41W / 250W | 0MiB / 16160MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 2 Tesla V100-PCIE... On | 0000B111:00:00.0 Off | Off | | N/A 34C P0 39W / 250W | 0MiB / 16160MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 3 Tesla V100-PCIE... On | 0000BD71:00:00.0 Off | Off | | N/A 32C P0 40W / 250W | 0MiB / 16160MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
通过nvidia-container-cli -k -d /dev/tty info这个命令,发现了could not start driver service: load library failed: libnvidia-fatbinaryloader.so.465.19.01: cannot open shared object file: no such file or directory这样一条日志,但是奇怪的是,我查看对应driver的libcuda,也是安装了的。
Comments