gpu-operator是nvidia用来在k8s集群中管理节点gpu驱动容器、runtime以及dcgm监控的operator
network-operator是nvidia用来在k8s集群中管理pod的第二网络ip分配以及IB驱动的部署
环境
版本 | |
---|---|
OS | ubuntu 22.04 |
Kubernetes | 1.31 |
gpu-operator | v25.3.0 |
GPU | H200 |
network-operator | 25.1.0 |
gpu-operator
下载
下载其实是为了查看values.yaml中有些什么配置
添加 helm repo
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \
&& helm repo update
查看gpu-operator版本
$ helm search repo gpu
NAME CHART VERSION APP VERSION DESCRIPTION
nvidia/gpu-operator v25.3.0 v25.3.0 NVIDIA GPU Operator creates/configures/manages ...
nvidia/k8s-nim-operator 1.0.1 1.0.1 NVIDIA NIM Operator creates/configures/manages ...
nvidia/network-operator 25.1.0 v25.1.0 Nvidia network operator
下载gpu-operator
helm pull nvidia/gpu-operator --version=v25.3.0
# 解压operator包
tar zxf gpu-operator-v25.3.0.tgz
确认配置
开启nfd,用于节点上硬件设备的自动发现
nfd.enabled=true
rdma的driver通过network-operator来部署,参考
driver.rdma.enabled=true
# 宿主机上没有驱动
driver.rdma.useHostMofed=false
在nvidia的catalog中找到适用ubuntu 22.04的镜像版本
# cuda镜像tag:https://catalog.ngc.nvidia.com/orgs/nvidia/containers/cuda/tags
operator.initContainer.version=12.8.1-base-ubuntu22.04
# 这个镜像目前没有22.04的:https://catalog.ngc.nvidia.com/orgs/nvidia/teams/k8s/containers/container-toolkit/tags
toolkit.version=v1.17.5-ubuntu20.04
安装
国内pull nv的镜像是有网络问题的,自行解决吧~
helm install --wait gpu-operator \
-n gpu-operator --create-namespace \
nvidia/gpu-operator \
--version=v25.3.0 \
--set nfd.enabled=true \
--set driver.rdma.useHostMofed=true \
--set operator.initContainer.version=12.8.1-base-ubuntu22.04
安装成功后的pod如下
kubectl get po -n gpu-operator
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-m7nn7 1/1 Running 0 102m
gpu-operator-6d855bcb58-hrvgf 1/1 Running 2 (32m ago) 102m
gpu-operator-node-feature-discovery-gc-78d798587d-ptqxs 1/1 Running 0 102m
gpu-operator-node-feature-discovery-master-96db5444c-zhtnc 1/1 Running 1 (32m ago) 102m
gpu-operator-node-feature-discovery-worker-lltvr 1/1 Running 2 (32m ago) 102m
nvidia-container-toolkit-daemonset-hkbrh 1/1 Running 0 102m
nvidia-cuda-validator-v45fc 0/1 Completed 0 96m
nvidia-dcgm-exporter-h5zvj 1/1 Running 0 102m
nvidia-device-plugin-daemonset-sml9x 1/1 Running 0 102m
nvidia-driver-daemonset-26d7p 1/1 Running 0 102m
nvidia-mig-manager-7x6rd 1/1 Running 0 95m
nvidia-operator-validator-jqvfs 1/1 Running 0 102m
network-operator
下载
helm pull nvidia/network-operator --version=25.1.0
# 解压operator包
tar zxf network-operator-25.1.0.tgz
确认配置
只需要确认nfd
是false
就可以,默认是false
,和gpu-operator中二者其中一个部署nfd即可
nfd.enabled=false
安装
helm install network-operator nvidia/network-operator \
-n nvidia-network-operator \
--create-namespace \
--version v25.1.0 \
--wait
创建NicClusterPolicy
$ cat NicClusterPolicy.yaml
apiVersion: mellanox.com/v1alpha1
kind: NicClusterPolicy
metadata:
name: nic-cluster-policy
spec:
ofedDriver:
image: doca-driver
repository: nvcr.io/nvidia/mellanox
version: 25.01-0.6.0.0-0
forcePrecompiled: false
imagePullSecrets: []
terminationGracePeriodSeconds: 300
startupProbe:
initialDelaySeconds: 10
periodSeconds: 20
livenessProbe:
initialDelaySeconds: 30
periodSeconds: 30
readinessProbe:
initialDelaySeconds: 10
periodSeconds: 30
upgradePolicy:
autoUpgrade: true
maxParallelUpgrades: 1
safeLoad: false
drain:
enable: true
force: true
podSelector: ""
timeoutSeconds: 300
deleteEmptyDir: true
rdmaSharedDevicePlugin:
# [map[ifNames:[ens1f0] name:rdma_shared_device_a]]
image: k8s-rdma-shared-dev-plugin
repository: ghcr.io/mellanox
version: v1.5.2
imagePullSecrets: []
# The config below directly propagates to k8s-rdma-shared-device-plugin configuration.
# Replace 'devices' with your (RDMA capable) netdevice name.
config: |
{
"configList": [
{
"resourceName": "rdma_shared_device_a",
"rdmaHcaMax": 63,
"selectors": {
"vendors": [],
"deviceIDs": [],
"drivers": [],
"ifNames": ["ibp26s0","ibp60s0"],
"linkTypes": []
}
},
{
"resourceName": "rdma_shared_device_b",
"rdmaHcaMax": 63,
"selectors": {
"vendors": [],
"deviceIDs": [],
"drivers": [],
"ifNames": ["ibp77s0","ibp94s0"],
"linkTypes": []
}
}
]
}
重点是config
部分
resourceName:给pod分配资源时指定的资源名,默认是
rdma/rdma_shared_device_a
rdmaHcaMax:给kubelet可分配的资源数量
vendors:
lspci -nn|grep -i mel
查看到的[15b3:1021]中的15b3deviceIDs:
lspci -nn|grep -i mel
查看到的[15b3:1021]中的1021ifNames:宿主机os上
ifconfig
或者ip a
中显示的设备名
这部分配置要么写ifNames
,要么写vendors
和deviceIDs
来确认这一个resourceName
中有几块网卡
kubectl create -f NicClusterPolicy.yaml
创建后运行的pod如下:
$ kubectl get po -n nvidia-network-operator
NAME READY STATUS RESTARTS AGE
mofed-ubuntu22.04-898d5f6-ds-m82rj 1/1 Running 0 132m
network-operator-84c4dc4746-7f7f8 1/1 Running 2 (62m ago) 125m
rdma-shared-dp-ds-2t2zm 1/1 Running 0 87m
跨pod IB测试
pod1
apiVersion: v1
kind: Pod
metadata:
name: mofed-test-pod1
spec:
restartPolicy: OnFailure
containers:
#- image: mellanox/rping-test
- image: dtnaas/ofed:5.4-3
name: mofed-test-ctr
securityContext:
capabilities:
add: [ "IPC_LOCK", "SYS_RESOURCE" ]
resources:
limits:
rdma/rdma_shared_device_a: 1
command:
- sh
- -c
- |
ls -l /dev/infiniband /sys/class/infiniband /sys/class/net
sleep 1000000
pod2
apiVersion: v1
kind: Pod
metadata:
name: mofed-test-pod2
spec:
restartPolicy: OnFailure
containers:
#- image: mellanox/rping-test
- image: dtnaas/ofed:5.4-3
name: mofed-test-ctr
securityContext:
capabilities:
add: [ "IPC_LOCK", "SYS_RESOURCE" ]
resources:
limits:
rdma/rdma_shared_device_b: 1 # 我这里使用的和pod1是不一样的
command:
- sh
- -c
- |
ls -l /dev/infiniband /sys/class/infiniband /sys/class/net
sleep 1000000
测试
pod1启动server端
$ kubectl exec -it mofed-test-pod1 -- bash
$ ibv_devices
device node GUID
------ ----------------
mlx5_0 5c257303000c06c8
mlx5_1 5c257303000c0248
# 使用mlx5_0启动server端
$ ib_write_bw -d mlx5_0 -a -F --report_gbits
pod2启动client端
$ kubectl exec -it mofed-test-pod2 -- bash
$ ibv_devices
device node GUID
------ ----------------
mlx5_2 5c2573030006ca7a
mlx5_3 5c257303000c07fc
# 使用mlx5_2启动client端去连接server端,这里的ip是server端pod的ip
$ ib_write_bw -F -d mlx5_2 10.233.123.82 -D 10 --cpu_util --report_gbits
结果如下:网卡速率都是362.56Gb
# pod1
************************************
* Waiting for client to connect... *
************************************
---------------------------------------------------------------------------------------
RDMA_Write BW Test
Dual-port : OFF Device : mlx5_0
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
PCIe relax order: ON
ibv_wr* API : ON
CQ Moderation : 100
Mtu : 4096[B]
Link type : IB
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0x1a QPN 0x0048 PSN 0x9a1f1a RKey 0x1fff00 VAddr 0x007ff907a9f000
remote address: LID 0x33 QPN 0x0048 PSN 0xedadbc RKey 0x1fff00 VAddr 0x007f51c2a9e000
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps]
65536 4149167 0.00 362.56 0.691527
---------------------------------------------------------------------------------------
# pod2
---------------------------------------------------------------------------------------
RDMA_Write BW Test
Dual-port : OFF Device : mlx5_2
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
PCIe relax order: ON
ibv_wr* API : ON
TX depth : 128
CQ Moderation : 1
Mtu : 4096[B]
Link type : IB
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0x33 QPN 0x0048 PSN 0xedadbc RKey 0x1fff00 VAddr 0x007f51c2a9e000
remote address: LID 0x1a QPN 0x0048 PSN 0x9a1f1a RKey 0x1fff00 VAddr 0x007ff907a9f000
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps] CPU_Util[%]
65536 4149167 0.00 362.56 0.691527 0.74
---------------------------------------------------------------------------------------