cannot solve Error 101: invalid device ordinal

we have one server with 5 old gpus (nvidia titan xp * 5).
when execute the code x.to("cuda") and get the following error, but torch.cuda.device_count() can correctly return 5.
RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 101: invalid device ordinal
according to some reference from web, we have tried reinstall the nvidia driver and cuda toolkit, but didn't work.
nvidia-smi response screenshot
what other measures can we adopt?
(we have tried cuda-toolkit version 12.4, 12.2 and 11.6)
Answer
try to add a number on it if you have multiple GPUs. It should start at 0 (I think)
Try something like x.to("cuda:1")
Enjoyed this article?
Check out more content on our blog or follow us on social media.
Browse more articles