NVIDIA RTX A6000 on PowerEdge T640 #3
Labels
No Label
No Milestone
No project
No Assignees
1 Participants
Notifications
Total Time Spent: 10 minutes 58 seconds
Due Date
eric
10 minutes 58 seconds
No due date set.
Blocks
Depends on
#19 Nextcloud Assistant
DevOps/ansible-role-eom
#8 PowerEdge R720
DevOps/software-infrastructure
Reference: DevOps/software-infrastructure#3
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
The NVIDIA GPU provisioned to the T640 server is unusable - failing to load with the error
RmInitAdapter failed!
. Investigate the cause and resolve the issue so that the GPU is visible usingnvidia-smi
The "solution" to this may be purchasing a new GPU.
This post from the NVIDIA forums suggests using the open drivers. It is not clear from where to run the specified command,
sudo ./NVIDIA-Linux-x86_64-525.116.04.run -m=kernel-open
.This is the exact error:
Driver version
550.127.08
was released on Nov. 19, 2024.Help could potentially be recruited from local businesses. For example: https://www.abettercomputerservice.com/
nvidia-detect
was installed, and it recommends the defaultnvidia-driver
package. The 470 drivers were also mentioned.The 470 driver was installed. The PowerEdge T640 was rebooted. DHCP leases had expired, so alpha-control-plane came back up with a different IP address. This caused the cluster to fail to come back online. The control plane was given a static IP and the system was rebooted once more. The cluster came back online once this was done. Terrifying. Though disaster was avoided, moving the control plane to the PowerEdge R350 was considered. Perhaps there is a cleaner way to do this.
The 470 driver did not work.
Attempted updating to latest version of
nvidia-driver
from Debian repos. Still no luck.I am experimenting with installing the GPU on a PowerEdge R720 with Arch Linux instead of Debian (which was previously able to utilize the device).
It has occurred to me that the Wayland desktop may be causing issues with the GPU. I should try booting to Xorg or sans desktop before more intrusive methods are attempted.
From the GNOME configuration menu, the T640 appears to be already using X11 windowing. Maybe the 1100W PSU from the R720 would provide sufficient power for the T640?
1100W PSUs have been installed and the
RmInitAdapter
error persists. I may try the proprietary drivers.The A6000 has been uninstalled and replaced with a Tesla T4.
The A6000 has been uninstalled and replaced with a Tesla T4.