gpu-operator是nvidia用来在k8s集群中管理节点gpu驱动容器、runtime以及dcgm监控的operator

network-operator是nvidia用来在k8s集群中管理pod的第二网络ip分配以及IB驱动的部署

环境

版本
OS ubuntu 22.04
Kubernetes 1.31
gpu-operator v25.3.0
GPU H200
network-operator 25.1.0

gpu-operator

参考

下载

下载其实是为了查看values.yaml中有些什么配置

添加 helm repo

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \
    && helm repo update

查看gpu-operator版本

$ helm search repo gpu
NAME                       CHART VERSION    APP VERSION    DESCRIPTION                                       
nvidia/gpu-operator        v25.3.0          v25.3.0        NVIDIA GPU Operator creates/configures/manages ...
nvidia/k8s-nim-operator    1.0.1            1.0.1          NVIDIA NIM Operator creates/configures/manages ...
nvidia/network-operator    25.1.0           v25.1.0        Nvidia network operator

下载gpu-operator

helm pull nvidia/gpu-operator --version=v25.3.0 
# 解压operator包
tar zxf gpu-operator-v25.3.0.tgz

确认配置

开启nfd,用于节点上硬件设备的自动发现

nfd.enabled=true

rdma的driver通过network-operator来部署,参考

driver.rdma.enabled=true
# 宿主机上没有驱动
driver.rdma.useHostMofed=false

在nvidia的catalog中找到适用ubuntu 22.04的镜像版本

# cuda镜像tag:https://catalog.ngc.nvidia.com/orgs/nvidia/containers/cuda/tags
operator.initContainer.version=12.8.1-base-ubuntu22.04
# 这个镜像目前没有22.04的:https://catalog.ngc.nvidia.com/orgs/nvidia/teams/k8s/containers/container-toolkit/tags
toolkit.version=v1.17.5-ubuntu20.04

安装

国内pull nv的镜像是有网络问题的,自行解决吧~

helm install --wait gpu-operator \
    -n gpu-operator --create-namespace \
    nvidia/gpu-operator \
    --version=v25.3.0 \
    --set nfd.enabled=true \
    --set driver.rdma.useHostMofed=true \
    --set operator.initContainer.version=12.8.1-base-ubuntu22.04

安装成功后的pod如下

kubectl get po -n gpu-operator
NAME                                                         READY   STATUS      RESTARTS      AGE
gpu-feature-discovery-m7nn7                                  1/1     Running     0             102m
gpu-operator-6d855bcb58-hrvgf                                1/1     Running     2 (32m ago)   102m
gpu-operator-node-feature-discovery-gc-78d798587d-ptqxs      1/1     Running     0             102m
gpu-operator-node-feature-discovery-master-96db5444c-zhtnc   1/1     Running     1 (32m ago)   102m
gpu-operator-node-feature-discovery-worker-lltvr             1/1     Running     2 (32m ago)   102m
nvidia-container-toolkit-daemonset-hkbrh                     1/1     Running     0             102m
nvidia-cuda-validator-v45fc                                  0/1     Completed   0             96m
nvidia-dcgm-exporter-h5zvj                                   1/1     Running     0             102m
nvidia-device-plugin-daemonset-sml9x                         1/1     Running     0             102m
nvidia-driver-daemonset-26d7p                                1/1     Running     0             102m
nvidia-mig-manager-7x6rd                                     1/1     Running     0             95m
nvidia-operator-validator-jqvfs                              1/1     Running     0             102m

network-operator

参考

下载

helm pull nvidia/network-operator --version=25.1.0 
# 解压operator包
tar zxf network-operator-25.1.0.tgz

确认配置

只需要确认nfdfalse就可以,默认是false,和gpu-operator中二者其中一个部署nfd即可

nfd.enabled=false

安装

helm install network-operator nvidia/network-operator \
   -n nvidia-network-operator \
   --create-namespace \
   --version v25.1.0 \
   --wait

创建NicClusterPolicy

$ cat NicClusterPolicy.yaml
apiVersion: mellanox.com/v1alpha1
 kind: NicClusterPolicy
 metadata:
   name: nic-cluster-policy
 spec:
   ofedDriver:
     image: doca-driver
     repository: nvcr.io/nvidia/mellanox
     version: 25.01-0.6.0.0-0
     forcePrecompiled: false
     imagePullSecrets: []
     terminationGracePeriodSeconds: 300
     startupProbe:
       initialDelaySeconds: 10
       periodSeconds: 20
     livenessProbe:
       initialDelaySeconds: 30
       periodSeconds: 30
     readinessProbe:
       initialDelaySeconds: 10
       periodSeconds: 30
     upgradePolicy:
       autoUpgrade: true
       maxParallelUpgrades: 1
       safeLoad: false
       drain:
         enable: true
         force: true
         podSelector: ""
         timeoutSeconds: 300
         deleteEmptyDir: true
   rdmaSharedDevicePlugin:
     # [map[ifNames:[ens1f0] name:rdma_shared_device_a]]
     image: k8s-rdma-shared-dev-plugin
     repository: ghcr.io/mellanox
     version: v1.5.2
     imagePullSecrets: []
     # The config below directly propagates to k8s-rdma-shared-device-plugin configuration.
     # Replace 'devices' with your (RDMA capable) netdevice name.
    config: |
      {
        "configList": [
          {
            "resourceName": "rdma_shared_device_a",
            "rdmaHcaMax": 63,
            "selectors": {
              "vendors": [],
              "deviceIDs": [],
              "drivers": [],
              "ifNames": ["ibp26s0","ibp60s0"],
              "linkTypes": []
            }
          },
          {
             "resourceName": "rdma_shared_device_b",
             "rdmaHcaMax": 63,
             "selectors": {
               "vendors": [],
               "deviceIDs": [],
               "drivers": [],
               "ifNames": ["ibp77s0","ibp94s0"],
               "linkTypes": []
            }
          }
        ]
      }

重点是config部分

resourceName:给pod分配资源时指定的资源名,默认是rdma/rdma_shared_device_a

rdmaHcaMax:给kubelet可分配的资源数量

vendors:lspci -nn|grep -i mel查看到的[15b3:1021]中的15b3

deviceIDs:lspci -nn|grep -i mel查看到的[15b3:1021]中的1021

ifNames:宿主机os上ifconfig或者ip a中显示的设备名

这部分配置要么写ifNames,要么写vendorsdeviceIDs来确认这一个resourceName中有几块网卡

kubectl create -f NicClusterPolicy.yaml

创建后运行的pod如下:

$ kubectl get po -n nvidia-network-operator
NAME                                 READY   STATUS    RESTARTS      AGE
mofed-ubuntu22.04-898d5f6-ds-m82rj   1/1     Running   0             132m
network-operator-84c4dc4746-7f7f8    1/1     Running   2 (62m ago)   125m
rdma-shared-dp-ds-2t2zm              1/1     Running   0             87m

跨pod IB测试

pod1

apiVersion: v1
kind: Pod
metadata:
  name: mofed-test-pod1
spec:
  restartPolicy: OnFailure
  containers:
    #- image: mellanox/rping-test
  - image: dtnaas/ofed:5.4-3
    name: mofed-test-ctr
    securityContext:
      capabilities:
        add: [ "IPC_LOCK", "SYS_RESOURCE" ]
    resources:
      limits:
        rdma/rdma_shared_device_a: 1
    command:
    - sh
    - -c
    - |
      ls -l /dev/infiniband /sys/class/infiniband /sys/class/net
      sleep 1000000

pod2

apiVersion: v1
kind: Pod
metadata:
  name: mofed-test-pod2
spec:
  restartPolicy: OnFailure
  containers:
    #- image: mellanox/rping-test
  - image: dtnaas/ofed:5.4-3
    name: mofed-test-ctr
    securityContext:
      capabilities:
        add: [ "IPC_LOCK", "SYS_RESOURCE" ]
    resources:
      limits:
        rdma/rdma_shared_device_b: 1   # 我这里使用的和pod1是不一样的
    command:
    - sh
    - -c
    - |
      ls -l /dev/infiniband /sys/class/infiniband /sys/class/net
      sleep 1000000

测试

pod1启动server端

$ kubectl exec -it mofed-test-pod1 -- bash
$ ibv_devices 
    device                 node GUID
    ------              ----------------
    mlx5_0              5c257303000c06c8
    mlx5_1              5c257303000c0248
# 使用mlx5_0启动server端
$ ib_write_bw -d mlx5_0 -a -F --report_gbits

pod2启动client端

$ kubectl exec -it mofed-test-pod2 -- bash
$ ibv_devices 
    device                 node GUID
    ------              ----------------
    mlx5_2              5c2573030006ca7a
    mlx5_3              5c257303000c07fc
# 使用mlx5_2启动client端去连接server端,这里的ip是server端pod的ip
$ ib_write_bw  -F -d mlx5_2 10.233.123.82  -D 10 --cpu_util --report_gbits

结果如下:网卡速率都是362.56Gb

# pod1
************************************
* Waiting for client to connect... *
************************************
---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF        Device         : mlx5_0
 Number of qps   : 1        Transport type : IB
 Connection type : RC        Using SRQ      : OFF
 PCIe relax order: ON
 ibv_wr* API     : ON
 CQ Moderation   : 100
 Mtu             : 4096[B]
 Link type       : IB
 Max inline data : 0[B]
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0x1a QPN 0x0048 PSN 0x9a1f1a RKey 0x1fff00 VAddr 0x007ff907a9f000
 remote address: LID 0x33 QPN 0x0048 PSN 0xedadbc RKey 0x1fff00 VAddr 0x007f51c2a9e000
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[Gb/sec]    BW average[Gb/sec]   MsgRate[Mpps]
 65536      4149167          0.00               362.56            0.691527
---------------------------------------------------------------------------------------


# pod2
---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF        Device         : mlx5_2
 Number of qps   : 1        Transport type : IB
 Connection type : RC        Using SRQ      : OFF
 PCIe relax order: ON
 ibv_wr* API     : ON
 TX depth        : 128
 CQ Moderation   : 1
 Mtu             : 4096[B]
 Link type       : IB
 Max inline data : 0[B]
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0x33 QPN 0x0048 PSN 0xedadbc RKey 0x1fff00 VAddr 0x007f51c2a9e000
 remote address: LID 0x1a QPN 0x0048 PSN 0x9a1f1a RKey 0x1fff00 VAddr 0x007ff907a9f000
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[Gb/sec]    BW average[Gb/sec]   MsgRate[Mpps]    CPU_Util[%]
 65536      4149167          0.00               362.56            0.691527        0.74
---------------------------------------------------------------------------------------

nccl测试

参考:kubernetes测试nvidia-nccl-IB-GPU

评论




正在载入...
PoweredHexo
HostedAliyun
DNSAliyun
ThemeVolantis
UV
PV
BY-NC-SA 4.0