Cuda failure: CUDA driver version is insufficient for CUDA runtime version

anhmantran · July 27, 2020, 5:43pm

Description

As title says, I am getting this error and can’t seem to get rid of it while starting tensorrt and running everything on the latest available versions.

Environment

TensorRT Version: 7.1.3.4
GPU Type: RTX2070
Nvidia Driver Version: 450.57
CUDA Version: 11.0.2
CUDNN Version: 8.0.1
Operating System + Version: ubuntu 18.04
Python Version (if applicable): 3.7.5
TensorFlow Version (if applicable):
PyTorch Version (if applicable): 1.5.1
Baremetal or Container (if container which image + tag): VM

Relevant Files

Steps To Reproduce

using the python script trying to convert ONNX file to TensorRT
trtexec --onnx=yolov4_5_3_608_608.onnx --workspace=4096 --saveEngine=yolov4-5 --fp16 --explicitBatch

Note that I tested also with the driver version included in the CUDA toolkit package with the same outcome. I am also running caffe and tensorRT models on opencv on this cudnn, cuda and driver set without this problem.

AakankshaS · July 28, 2020, 4:35am

Hi @anhmantran,
I tried reproducing your issue with yolov4 model, and it worked fine for me.
Can you share your onnx model so that i can try on that?
Also please check once if you are using compatible versions of cuda from the below link

Thanks!

anhmantran · July 28, 2020, 7:09am

The ONNX file is the public model YOLO V4. trtexec crashes out before getting to even look for the file so I don’t think it has anything to do with it. I get the same error with any random file name I put in the command.
I also just upgraded cudnn to 8.0.2 GA and the problem remains. trtexec just refuses to do anything. I also uninstalled and reinstalled tensorRT and the problem is still there. It just doesn’t appear to read the cuda drive version correctly. Any way to bypass that check or at least figure out where it is getting its cuda or driver version from?

So I am running openCV, Pytorch and tensorflow all see the driver correctly and only tensorRT has this problem…

AakankshaS · July 28, 2020, 11:10am

Hi @anhmantran,

Can you please validate with installation guide, on any missing steps?
However to avoid any system related dependency, we recommend you to use NGC containers.

Thanks!

anhmantran · July 28, 2020, 2:10pm

Thank you for your response. I did follow the installation steps very carefully and as I said, I even tested with the 450.51.5 driver which came embedded in the cuda tool kit. All my other frameworks work even using the RT cores and only tensorRT is problematic. And unfortunately, I am one of the people who hate containers with a passion due to the added layer of management and complication they involve so no, that’s not an option. I will just be using other frameworks for now.

anhmantran · July 30, 2020, 9:58pm

Quick update on this: I tried starting the python binding to tensorrt and am getting a similar error 35.
This is getting really frustrating and please do not tell me it is a cudnn, driver or cuda version mismatch.
Cuda and cudnn have been working fine and my GPU is running 7 different inferences on other frameworks using CUDA.
I am suspecting that tensorrt is checking is somehow checking in the wrong place, I just don’t know where and what it is looking for and it is possible that some remnants of previous driver installations are still there though this is a pretty new installation and I have never had any version of cuda below 11 or driver below 450.5x

anhmantran · July 31, 2020, 2:05pm

I have gone through and reinstalled the driver, uninstalled cuda11 toolkit, reinstalled the cuda toolkit with run making sure that it was uninstalled from apt/deb and following all the documentations
nvidia-smi shows:
| NVIDIA-SMI 450.57 Driver Version: 450.57 CUDA Version: 11.0 |

lspci shows:
00:06.0 VGA compatible controller: NVIDIA Corporation Device 1f02 (rev a1)

lsmod | grep nvidia shows:
nvidia_uvm 974848 4
nvidia_drm 49152 0
nvidia_modeset 1179648 1 nvidia_drm
nvidia 19632128 1448 nvidia_uvm,nvidia_modeset
drm_kms_helper 172032 1 nvidia_drm
drm 397312 3 drm_kms_helper,nvidia_drm

I am not getting any return to lsmod | grep nouveau

I have verified that
/dev/nvidia0 /dev/nvidiactl /dev/nvidia-uvm /dev/nvidia-uvm-tools
all exist.

nvcc -V outputs:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Thu_Jun_11_22:26:38_PDT_2020
Cuda compilation tools, release 11.0, V11.0.194
Build cuda_11.0_bu.TC445_37.28540450_0

I am finding only one nvidia.ko in my system and it is located here:
/lib/modules/4.15.0-112-lowlatency/updates/dkms/nvidia.ko

running modinfo on it gives this:

filename: /lib/modules/4.15.0-112-lowlatency/updates/dkms/nvidia.ko
alias: char-major-195-*
version: 450.57
supported: external
license: NVIDIA
srcversion: CCB2AEF641D4CD7A82E48B3
alias: pci:v000010DEdsvsdbc03sc02i00
alias: pci:v000010DEdsvsdbc03sc00i00
depends:
retpoline: Y
name: nvidia
vermagic: 4.15.0-112-lowlatency SMP preempt mod_unload
parm: NvSwitchRegDwords:NvSwitch regkey (charp)
parm: NvSwitchBlacklist:NvSwitchBlacklist=uuid[,uuid…] (charp)
parm: nv_cap_enable_devfs:nv_cap_enable_devfs=0 or 1 (int)
parm: NVreg_ResmanDebugLevel:int
parm: NVreg_RmLogonRC:int
parm: NVreg_ModifyDeviceFiles:int
parm: NVreg_DeviceFileUID:int
parm: NVreg_DeviceFileGID:int
parm: NVreg_DeviceFileMode:int
parm: NVreg_InitializeSystemMemoryAllocations:int
parm: NVreg_UsePageAttributeTable:int
parm: NVreg_MapRegistersEarly:int
parm: NVreg_RegisterForACPIEvents:int
parm: NVreg_EnablePCIeGen3:int
parm: NVreg_EnableMSI:int
parm: NVreg_TCEBypassMode:int
parm: NVreg_EnableStreamMemOPs:int
parm: NVreg_EnableBacklightHandler:int
parm: NVreg_RestrictProfilingToAdminUsers:int
parm: NVreg_PreserveVideoMemoryAllocations:int
parm: NVreg_DynamicPowerManagement:int
parm: NVreg_DynamicPowerManagementVideoMemoryThreshold:int
parm: NVreg_EnableUserNUMAManagement:int
parm: NVreg_MemoryPoolSize:int
parm: NVreg_KMallocHeapMaxSize:int
parm: NVreg_VMallocHeapMaxSize:int
parm: NVreg_IgnoreMMIOCheck:int
parm: NVreg_NvLinkDisable:int
parm: NVreg_EnablePCIERelaxedOrderingMode:int
parm: NVreg_RegisterPCIDriver:int
parm: NVreg_RegistryDwords:charp
parm: NVreg_RegistryDwordsPerDevice:charp
parm: NVreg_RmMsg:charp
parm: NVreg_GpuBlacklist:charp
parm: NVreg_TemporaryFilePath:charp
parm: NVreg_AssignGpus:charp

What am I missing?
The tensorRT samples pop the same error:
[TRT] CUDA initialization failure with error 35.
Meanwhile opencv, ffmpeg, pytorch have been running fine with cuda/cudnn enabled.
I researched this a bit and it seems to be a long standing error across a number of versions where some inconsistencies exist in how the version is checked and how cuda is invoked. I am sure I have some installation issues but I can’t point to any.

anhmantran · August 1, 2020, 4:45am

Answering my own problem which I seem to have resolved:

running this command to find my cuda libraries:

ldconfig -p | grep libcuda

got me this:

    libcudart.so.11.0 (libc6,x86-64) => /usr/local/cuda/lib64/libcudart.so.11.0
    libcudart.so (libc6,x86-64) => /usr/local/cuda/lib64/libcudart.so
    libcuda.so.1 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libcuda.so.1
    libcuda.so.1 (libc6) => /usr/lib32/libcuda.so.1
    libcuda.so (libc6,x86-64) => /usr/local/cuda/lib64/libcuda.so
    libcuda.so (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libcuda.so
    libcuda.so (libc6) => /usr/lib32/libcuda.so

Which then lead me to look into the library links in the cuda folder and compare the actual library files:

$ ls -l /usr/lib/x86_64-linux-gnu/libcuda*
lrwxrwxrwx 1 root root       12 Jul 31 13:33 /usr/lib/x86_64-linux-gnu/libcuda.so -> libcuda.so.1
lrwxrwxrwx 1 root root       17 Jul 31 13:33 /usr/lib/x86_64-linux-gnu/libcuda.so.1 -> libcuda.so.450.57
-rwxr-xr-x 1 root root 19034984 Jul 31 13:33 /usr/lib/x86_64-linux-gnu/libcuda.so.450.57

$ ls -l /usr/local/cuda/lib64/libcuda*
-rw-r--r-- 1 root root 795844 Jul 31 13:31 /usr/local/cuda/lib64/libcudadevrt.a
lrwxrwxrwx 1 root root     17 Jul 31 13:31 /usr/local/cuda/lib64/libcudart.so -> libcudart.so.11.0
lrwxrwxrwx 1 root root     21 Jul 31 13:30 /usr/local/cuda/lib64/libcudart.so.11.0 -> libcudart.so.11.0.194
-rwxr-xr-x 1 root root 514936 Jul 31 13:30 /usr/local/cuda/lib64/libcudart.so.11.0.194
-rw-r--r-- 1 root root 931088 Jul 31 13:31 /usr/local/cuda/lib64/libcudart_static.a
lrwxrwxrwx 1 root root     38 Jun 25 15:59 /usr/local/cuda/lib64/libcuda.so -> /usr/local/cuda/lib64/stubs/libcuda.so

$ ls -l /usr/local/cuda/lib64/stubs/libcuda.so
-rwxr-xr-x 1 root root 48992 Jul 31 13:31 /usr/local/cuda/lib64/stubs/libcuda.so

As you can see, somehow the libcuda.so library points to a static library within the cuda installation and not to the driver library. I don’t know how it became that way but I deleted the symbolic link and recreated a new one to point to the driver library:

$ sudo rm /usr/local/cuda/lib64/libcuda.so
$ sudo ln -s /usr/lib/x86_64-linux-gnu/libcuda.so /usr/local/cuda/lib64/libcuda.so

And this appears to have resolved the problem…