cannot solve Error 101: invalid device ordinal

cannot solve Error 101: invalid device ordinal

we have one server with 5 old gpus (nvidia titan xp * 5).

when execute the code x.to("cuda") and get the following error, but torch.cuda.device_count() can correctly return 5.

RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 101: invalid device ordinal

according to some reference from web, we have tried reinstall the nvidia driver and cuda toolkit, but didn't work.

nvidia-smi response screenshot

what other measures can we adopt?

(we have tried cuda-toolkit version 12.4, 12.2 and 11.6)

Answer

try to add a number on it if you have multiple GPUs. It should start at 0 (I think)

Try something like x.to("cuda:1")

Enjoyed this article?

Check out more content on our blog or follow us on social media.

Browse more articles