NVIDIA RTX A6000 on PowerEdge T640 #3

Open
opened 2024-12-12 02:02:58 +00:00 by eric · 14 comments
Owner

The NVIDIA GPU provisioned to the T640 server is unusable - failing to load with the error RmInitAdapter failed!. Investigate the cause and resolve the issue so that the GPU is visible using nvidia-smi

The NVIDIA GPU provisioned to the T640 server is unusable - failing to load with the error `RmInitAdapter failed!`. Investigate the cause and resolve the issue so that the GPU is visible using `nvidia-smi`
eric self-assigned this 2024-12-12 02:02:58 +00:00
Author
Owner

The "solution" to this may be purchasing a new GPU.

The "solution" to this may be purchasing a new GPU.
Author
Owner

This post from the NVIDIA forums suggests using the open drivers. It is not clear from where to run the specified command, sudo ./NVIDIA-Linux-x86_64-525.116.04.run -m=kernel-open.

[This post from the NVIDIA forums](https://forums.developer.nvidia.com/t/solved-rminitadapter-failed-to-load-530-41-03-or-any-nvidia-modules-other-than-450-236-01-linux-via-esxi-7-0u3-passthrough-pci-gtx-1650/253239) suggests using the open drivers. It is not clear from where to run the specified command, `sudo ./NVIDIA-Linux-x86_64-525.116.04.run -m=kernel-open`.
eric started working 2024-12-18 01:43:59 +00:00
Author
Owner

This is the exact error:

[3189488.647670] NVRM: GPU 0000:b3:00.0: RmInitAdapter failed! (0x24:0x72:1435)
[3189488.647722] NVRM: GPU 0000:b3:00.0: rm_init_adapter failed, device minor number 0
[3189488.911998] NVRM: GPU 0000:b3:00.0: RmInitAdapter failed! (0x24:0x72:1435)
[3189488.912054] NVRM: GPU 0000:b3:00.0: rm_init_adapter failed, device minor number 0
This is the exact error: ``` [3189488.647670] NVRM: GPU 0000:b3:00.0: RmInitAdapter failed! (0x24:0x72:1435) [3189488.647722] NVRM: GPU 0000:b3:00.0: rm_init_adapter failed, device minor number 0 [3189488.911998] NVRM: GPU 0000:b3:00.0: RmInitAdapter failed! (0x24:0x72:1435) [3189488.912054] NVRM: GPU 0000:b3:00.0: rm_init_adapter failed, device minor number 0 ```
Author
Owner

Driver version 550.127.08 was released on Nov. 19, 2024.

Driver version `550.127.08` was [released](https://www.nvidia.com/en-us/drivers/details/236256/) on Nov. 19, 2024.
Author
Owner

Help could potentially be recruited from local businesses. For example: https://www.abettercomputerservice.com/

Help could potentially be recruited from local businesses. For example: https://www.abettercomputerservice.com/
eric stopped working 2024-12-18 01:54:57 +00:00
10 minutes 58 seconds
Author
Owner

nvidia-detect was installed, and it recommends the default nvidia-driver package. The 470 drivers were also mentioned.

`nvidia-detect` was installed, and it recommends the default `nvidia-driver` package. The 470 drivers were also mentioned.
Author
Owner

The 470 driver was installed. The PowerEdge T640 was rebooted. DHCP leases had expired, so alpha-control-plane came back up with a different IP address. This caused the cluster to fail to come back online. The control plane was given a static IP and the system was rebooted once more. The cluster came back online once this was done. Terrifying. Though disaster was avoided, moving the control plane to the PowerEdge R350 was considered. Perhaps there is a cleaner way to do this.

The 470 driver did not work.

The 470 driver was installed. The PowerEdge T640 was rebooted. DHCP leases had expired, so alpha-control-plane came back up with a different IP address. This caused the cluster to fail to come back online. The control plane was given a static IP and the system was rebooted once more. The cluster came back online once this was done. Terrifying. Though disaster was avoided, moving the control plane to the PowerEdge R350 was considered. Perhaps there is a cleaner way to do this. The 470 driver did not work.
Author
Owner

Attempted updating to latest version of nvidia-driver from Debian repos. Still no luck.

Attempted updating to latest version of `nvidia-driver` from Debian repos. Still no luck.
eric added this to the Network Infrastructure project 2025-01-27 21:55:00 +00:00
eric added a new dependency 2025-01-29 18:04:39 +00:00
Author
Owner

I am experimenting with installing the GPU on a PowerEdge R720 with Arch Linux instead of Debian (which was previously able to utilize the device).

I am experimenting with installing the GPU on a PowerEdge R720 with Arch Linux instead of Debian (which was previously able to utilize the device).
eric added a new dependency 2025-02-25 14:10:09 +00:00
Author
Owner

It has occurred to me that the Wayland desktop may be causing issues with the GPU. I should try booting to Xorg or sans desktop before more intrusive methods are attempted.

It has occurred to me that the Wayland desktop may be causing issues with the GPU. I should try booting to Xorg or sans desktop before more intrusive methods are attempted.
Author
Owner

From the GNOME configuration menu, the T640 appears to be already using X11 windowing. Maybe the 1100W PSU from the R720 would provide sufficient power for the T640?

From the GNOME configuration menu, the T640 appears to be already using X11 windowing. Maybe the 1100W PSU from the R720 would provide sufficient power for the T640?
Author
Owner

1100W PSUs have been installed and the RmInitAdapter error persists. I may try the proprietary drivers.

1100W PSUs have been installed and the `RmInitAdapter` error persists. I may try the proprietary drivers.
eric added this to the (deleted) milestone 2025-03-02 19:43:23 +00:00
eric added this to the (deleted) milestone 2025-03-02 19:57:17 +00:00
Author
Owner

The A6000 has been uninstalled and replaced with a Tesla T4.

The A6000 has been uninstalled and replaced with a Tesla T4.
Author
Owner

The A6000 has been uninstalled and replaced with a Tesla T4.

The A6000 has been uninstalled and replaced with a Tesla T4.
eric closed this issue 2025-03-07 16:00:42 +00:00
eric reopened this issue 2025-03-15 20:04:31 +00:00
Sign in to join this conversation.
No Label
No Milestone
No Assignees
1 Participants
Notifications
Total Time Spent: 10 minutes 58 seconds
eric
10 minutes 58 seconds
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Blocks
#19 Nextcloud Assistant
DevOps/ansible-role-eom
Depends on
#8 PowerEdge R720
DevOps/software-infrastructure
Reference: DevOps/software-infrastructure#3
No description provided.