NVIDIA Tesla T4 #10

Closed
opened 2025-03-07 16:03:53 +00:00 by eric · 7 comments
Owner

A Tesla T4 GPU has been installed on the PowerEdge T640 that needs to be integrated into the Kubernetes cluster. Configure it as a passthrough device for alpha-worker-0 and make it available within Kubernetes.

A Tesla T4 GPU has been installed on the PowerEdge T640 that needs to be integrated into the Kubernetes cluster. Configure it as a passthrough device for alpha-worker-0 and make it available within Kubernetes.
eric self-assigned this 2025-03-07 16:04:04 +00:00
eric added a new dependency 2025-03-07 16:04:37 +00:00
Author
Owner

The following needs to be done:

  1. Uninstall Nvidia drivers from T640
  2. Update grub
  3. Enable VFIO drivers
  4. Setup device passthrough using virt-manager
  5. Install Nvidia drivers on alpha-worker-0
  6. Deploy Nvidia container toolkit
The following needs to be done: 1. Uninstall Nvidia drivers from T640 2. Update grub 3. Enable VFIO drivers 4. Setup device passthrough using virt-manager 5. Install Nvidia drivers on alpha-worker-0 6. Deploy Nvidia container toolkit
eric started working 2025-03-08 15:02:36 +00:00
Author
Owner

GPU passthrough to alpha-worker-0 was surprisingly smooth. Steps 1-5 are now complete. A separate issue should be made for automating the steps taken during this installation and for reconnecting the GPU to the guest VM on reboot.

GPU passthrough to alpha-worker-0 was surprisingly smooth. Steps 1-5 are now complete. A separate issue should be made for automating the steps taken during this installation and for reconnecting the GPU to the guest VM on reboot.
Author
Owner
Created DevOps/software-infrastructure#11
Author
Owner
And DevOps/software-infrastructure#12
Author
Owner

Documentation for scheduled GPUs in Kubernetes.

[Documentation](https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/) for scheduled GPUs in Kubernetes.
Author
Owner

This installation guide provides more detailed information.

This [installation guide](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html) provides more detailed information.
eric stopped working 2025-03-08 16:57:07 +00:00
1 hour 54 minutes
eric added a new dependency 2025-03-08 16:58:04 +00:00
eric started working 2025-03-08 17:23:43 +00:00
eric stopped working 2025-03-08 19:49:14 +00:00
2 hours 25 minutes
Author
Owner

The T4 appears to be available within Kubernetes; however, the underlying VM is running low on storage, preventing containers from being deployed.

The T4 appears to be available within Kubernetes; however, the underlying VM is running low on storage, preventing containers from being deployed.
eric closed this issue 2025-03-08 19:50:01 +00:00
Sign in to join this conversation.
No Label
No Milestone
No project
No Assignees
1 Participants
Notifications
Total Time Spent: 4 hours 20 minutes
eric
4 hours 20 minutes
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Blocks
#19 Nextcloud Assistant
DevOps/ansible-role-eom
#26 Deploy LocalAI
DevOps/ansible-role-eom
Reference: DevOps/software-infrastructure#10
No description provided.