后续会验证gpu-operatornetwork-operator的使用

环境

kubernetes:1.30

IB驱动版本:MLNX_OFED_LINUX-24.10-1.1.4.0

驱动查看

$ ofed_info -s
MLNX_OFED_LINUX-24.10-1.1.4.0

rdma-share-device-plugin安装

目前使用的版本应该1.5.2,我直接使用的master分支。

git clone https://github.com/Mellanox/k8s-rdma-shared-dev-plugin.git
cd k8s-rdma-shared-dev-plugin/deployment/k8s/base

修改configmap

查看IB设备与网络接口的对应关系

$ ibdev2netdev
mlx5_0 port 1 ==> ibp26s0 (Up)
mlx5_1 port 1 ==> ibp60s0 (Up)
mlx5_2 port 1 ==> ibp77s0 (Up)
mlx5_3 port 1 ==> ibp94s0 (Up)

将节点上的网络接口映射为k8s资源。根据接口名不同,只需要修改ifNames字段即可,条件允许就配置多个resourceName,可以验证单节点不同卡和跨节点的通信。

apiVersion: v1
kind: ConfigMap
metadata:
  name: rdma-devices
  namespace: kube-system
data:
  config.json: |
       {
         "configList": [
           {
             "resourceName": "rdma_shared_device_a",
             "rdmaHcaMax": 63,
             "selectors": {
               "vendors": [],
               "deviceIDs": [],
               "drivers": [],
               "ifNames": ["ibp26s0","ibp60s0"],
               "linkTypes": []
             }
           },
           {
             "resourceName": "rdma_shared_device_b",
             "rdmaHcaMax": 63,
             "selectors": {
               "vendors": [],
               "deviceIDs": [],
               "drivers": [],
               "ifNames": ["ibp77s0","ibp94s0"],
               "linkTypes": []
             }
           }
         ]
       }

安装

# 给有ib卡的节点设置label
$ kubectl label node gpu01 rdma=true --overwrite
$ pwd
k8s-rdma-shared-dev-plugin/deployment/k8s/base
$ kubectl apply -k .

等待插件正常running

$ kubectl get po -n kube-system -l name=rdma-shared-dp-ds
NAME                      READY   STATUS    RESTARTS   AGE
rdma-shared-dp-ds-cfwzj   1/1     Running   0          3h28m

如果报错有 pci.ids 相关内容,则需要挂载 pci.ids 文件

2025/03/07 02:40:54 Starting K8s RDMA Shared Device Plugin version= master
Using Kubelet Plugin Registry Mode
2025/03/07 02:40:54 resource manager reading configs
2025/03/07 02:40:54 Reading /k8s-rdma-shared-dev-plugin/config.json
2025/03/07 02:40:54 loaded config: [{ResourceName:hca_shared_devices_a ResourcePrefix: RdmaHcaMax:1000 Devices:[ibp26s0 ibp60s0] Selectors:{Vendors:[] DeviceIDs:[] Drivers:[] IfNames:[] LinkTypes:[]}} {ResourceName:hca_shared_devices_b ResourcePrefix: RdmaHcaMax:500 Devices:[] Selectors:{Vendors:[15b3] DeviceIDs:[1021] Drivers:[] IfNames:[ibp94s0 ibp77s0] LinkTypes:[]}}] 
2025/03/07 02:40:54 periodic update interval: +300 
2025/03/07 02:40:54 Warning: "devices" field is deprecated, it is recommended to use the new “selectors” field
2025/03/07 02:40:54 Discovering host devices
2025/03/07 02:40:54 discovering host network devices
2025/03/07 02:40:54 Error: error discovering host devices error getting PCI info: No pci-ids DB files found (and network fetch disabled) 

修改daemonset.yaml

...
    spec:
      # 增加了nodelabel,不然会在没有ib卡的节点上也部署这个daemonset
      nodeSelector:
        rdma: 'true'
      containers:
        volumeMounts:
          ...
          - name: pci-ids
            mountPath: /usr/share/misc/pci.ids
      volumes:
        ...
        - name: pci-ids
          hostPath:
            path: /usr/share/misc/pci.ids

测试IB连通性

rdma-share-device-plugin 官方的测试镜像不太好用,这里使用三方作者的镜像

pod1

# mofed-test-pod
apiVersion: v1
kind: Pod
metadata:
  name: mofed-test-pod
spec:
  restartPolicy: OnFailure
  containers:
    #- image: mellanox/rping-test
  - image: dtnaas/ofed:5.4-3
    name: mofed-test-ctr
    securityContext:
      capabilities:
        add: [ "IPC_LOCK", "SYS_RESOURCE" ]
    resources:
      limits:
        rdma/rdma_shared_device_a: 1
    command:
    - sh
    - -c
    - |
      ls -l /dev/infiniband /sys/class/infiniband /sys/class/net
      sleep 1000000

创建pod

kubectl create -f mofed-test-pod.yaml

pod2

# mofed-test-pod1
apiVersion: v1
kind: Pod
metadata:
  name: mofed-test-pod1
spec:
  restartPolicy: OnFailure
  containers:
    #- image: mellanox/rping-test
  - image: dtnaas/ofed:5.4-3
    name: mofed-test-ctr
    securityContext:
      capabilities:
        add: [ "IPC_LOCK", "SYS_RESOURCE" ]
    resources:
      limits:
        rdma/rdma_shared_device_b: 1   # 我这里使用的和pod1是不一样的
    command:
    - sh
    - -c
    - |
      ls -l /dev/infiniband /sys/class/infiniband /sys/class/net
      sleep 1000000

创建pod

kubectl create -f mofed-test-pod1.yaml

IB测试

$ kubectl exec -it mofed-test-pod -- bash
# 确认容器内申请到的ib device
$ ibv_devices
    device                 node GUID
    ------              ----------------
    mlx5_0              5c257303000c0158
    mlx5_1              5c2573030006c3da

$ kubectl exec -it mofed-test-pod1 -- bash
$ ibv_devices 
    device                 node GUID
    ------              ----------------
    mlx5_2              5c257303000c0298
    mlx5_3              5c25730300019b52

mofed-test-pod 运行服务端

# 注意mlx5_0是pod内可以看到的,使用mlx5_0建立服务端
root@mofed-test-pod:/# ib_write_bw -d mlx5_0 -a -F

************************************
* Waiting for client to connect... *
************************************

mofed-test-pod1 运行客户端

# 注意mlx5_2是pod内可以看到的,使用mlx5_2去连服务端的mlx5_0
# 这里的ip是服务端的eth0的ip
root@mofed-test-pod1:/# ib_write_bw  -F -d mlx5_2 10.233.68.10  -D 10 --cpu_util --report_gbits

客户端执行后,服务端和客户端都会输出如下带宽速率

############ mofed-test-pod

************************************
* Waiting for client to connect... *
************************************
---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF        Device         : mlx5_0
 Number of qps   : 1        Transport type : IB
 Connection type : RC        Using SRQ      : OFF
 PCIe relax order: ON
 ibv_wr* API     : ON
 CQ Moderation   : 100
 Mtu             : 4096[B]
 Link type       : IB
 Max inline data : 0[B]
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0x99 QPN 0x0061 PSN 0x6a8e46 RKey 0x1fff00 VAddr 0x007fb0cbf27000
 remote address: LID 0x7f QPN 0x0047 PSN 0xd4d904 RKey 0x1fff00 VAddr 0x007fb471b5e000
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
 65536      4182819          0.00               365.49            0.697124
---------------------------------------------------------------------------------------

############ mofed-test-pod1
---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF        Device         : mlx5_2
 Number of qps   : 1        Transport type : IB
 Connection type : RC        Using SRQ      : OFF
 PCIe relax order: ON
 ibv_wr* API     : ON
 TX depth        : 128
 CQ Moderation   : 1
 Mtu             : 4096[B]
 Link type       : IB
 Max inline data : 0[B]
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0x7f QPN 0x0047 PSN 0xd4d904 RKey 0x1fff00 VAddr 0x007fb471b5e000
 remote address: LID 0x99 QPN 0x0061 PSN 0x6a8e46 RKey 0x1fff00 VAddr 0x007fb0cbf27000
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[Gb/sec]    BW average[Gb/sec]   MsgRate[Mpps]    CPU_Util[%]
 65536      4182819          0.00               365.49            0.697124        0.61
---------------------------------------------------------------------------------------

以上则表示IB通信成功,速率为365G。这是400G的IB网卡

nccl测试

最终分布式的模型部署是使用nccl的库来调用IB通信的,所以使用 nccl 测试 GPU+IB 的跨机通信

构建有ibtools的cuda镜像:nccl+mpi,nv的cuda镜像拉取需要ngc key

# laiye-aifoundry-registry.cn-beijing.cr.aliyuncs.com/laiye-foundry/nvcr.io/nvidia/cuda:12.8.0-devel-ubuntu22.04
FROM nvcr.io/nvidia/cuda:12.8.0-devel-ubuntu22.04
RUN apt-get update && \
    apt-get install -y --no-install-recommends \
    git \
    libibverbs1 \
    librdmacm1 \
    ibverbs-providers \
    ibverbs-utils \
    libibumad3 \
    libmlx5-1 \
    librdmacm-dev \
    libibverbs-dev \
    iproute2 \
    net-tools \
    openssh-client \
    infiniband-diags \
    perftest \
    rdma-core \
    wget \
    vim \
    && rm -rf /var/lib/apt/lists/*
# ENV http_proxy=http://172.16.10.116:10809
# ENV https_proxy=http://172.16.10.116:10809
# mpi
ENV OMPI_VERSION=5.0.7
RUN mkdir /tmp/openmpi && cd /tmp/openmpi && \
    wget https://download.open-mpi.org/release/open-mpi/v5.0/openmpi-${OMPI_VERSION}.tar.gz && \
    tar zxf openmpi-${OMPI_VERSION}.tar.gz && \
    cd openmpi-${OMPI_VERSION} && \
    ./configure --prefix=/usr/local \
        --with-cuda=/usr/local/cuda \
        --with-verbs \
        --enable-mca-no-build=btl-uct \
        --enable-orterun-prefix-by-default \
        --without-psm \
        --without-psm2 \
        --without-libfabric \
        --with-rdma=libibverbs && \
    make -j$(nproc) && make install && \
    ldconfig && \
    rm -rf /tmp/openmpi
# nccl
WORKDIR /workspace
RUN git clone https://github.com/NVIDIA/nccl-tests.git && \
    cd nccl-tests && \
    make MPI=0 && \
    mv build/* /usr/local/bin/ && \
    cd .. && rm -rf nccl-tests

ENV LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:$LD_LIBRARY_PATH
ENV NCCL_DEBUG=INFO

构建镜像

docker build -t 172.16.10.3:5000/laiye-foundry/nccl_test:cuda12.8-ubuntu2204_00 .

mpirun statefulset

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: mpi-nccl-cluster
spec:
  serviceName: mpi-nccl-svc
  replicas: 2  # 根据实际节点数调整,我这里是两台8卡gpu+8卡ib
  selector:
    matchLabels:
      app: mpi-nccl
  template:
    metadata:
      labels:
        app: mpi-nccl
    spec:
      dnsPolicy: ClusterFirstWithHostNet
      containers:
      - name: mpi-node
        image: your-registry/mpi-nccl:latest
        securityContext:
          capabilities:
            add: ["IPC_LOCK", "SYS_ADMIN", "NET_ADMIN"]
        env:
        - name: LD_LIBRARY_PATH
          value: /usr/local/lib:/usr/local/cuda/lib64:/usr/lib64/mpi/gcc/openmpi/lib64
        ports:
        - containerPort: 2222  # SSH 端口
        resources:
          limits:
            nvidia.com/gpu: 2
            rdma/rdma_shared_device_a: 1

nccl IB+GPU测试

# 在主节点生成密钥
ssh-keygen -t rsa
# 两个pod配置root密码passwd
# 两个pod允许root登录
# 启动sshd
nohup /usr/sbin/sshd -D > /dev/null 2>&1  &
# 将公钥复制到所有 Pod
ssh-copy-id -i ~/.ssh/id_rsa.pub -p 2222 root@10.233.67.10
ssh-copy-id -i ~/.ssh/id_rsa.pub -p 2222 root@10.233.68.10
# 配置节点slot
cat host.txt
10.233.67.10 slots=8
10.233.68.10 slots=8
mpirun -np 16 \
       --hostfile host.txt \
       --allow-run-as-root \
       --mca plm_rsh_args "-p 2222" \
       --mca btl_tcp_if_include eth0 \
       -x NCCL_SOCKET_IFNAME=eth0 \
       -x NCCL_IB_HCA=mlx5_0,mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_7,mlx5_8,mlx5_9 \
       -x NCCL_DEBUG=INFO \
       -x CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
       /usr/local/bin/all_reduce_perf -b 1M -e 1G -f 2 -g 1 -c 0

有如下类似日志则表示成功

# 这里只展示关建部分,至少表示调用IB成功了
...
mpi-nccl-cluster-1:795:823 [0] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
mpi-nccl-cluster-1:795:823 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
mpi-nccl-cluster-1:795:823 [0] NCCL INFO NCCL_IB_HCA set to mlx5_0,mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_7,mlx5_8,mlx5_9
mpi-nccl-cluster-1:797:797 [0] NCCL INFO NCCL version 2.25.1+cuda12.8
mpi-nccl-cluster-1:792:792 [0] NCCL INFO NCCL version 2.25.1+cuda12.8
mpi-nccl-cluster-1:795:823 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [4]mlx5_4:1/IB [5]mlx5_8:1/IB [6]mlx5_9:1/IB [RO]; OOB eth0:10.233.67.10<0>
mpi-nccl-cluster-1:795:823 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so.
mpi-nccl-cluster-1:795:823 [0] NCCL INFO Using network IB
mpi-nccl-cluster-1:791:825 [0] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
mpi-nccl-cluster-1:791:825 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
mpi-nccl-cluster-1:791:825 [0] NCCL INFO NCCL_IB_HCA set to mlx5_0,mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_7,mlx5_8,mlx5_9
mpi-nccl-cluster-0:1105:1133 [0] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
mpi-nccl-cluster-0:1105:1133 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
mpi-nccl-cluster-0:1105:1133 [0] NCCL INFO NCCL_IB_HCA set to mlx5_0,mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_7,mlx5_8,mlx5_9
mpi-nccl-cluster-1:791:825 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [4]mlx5_4:1/IB [5]mlx5_8:1/IB [6]mlx5_9:1/IB [RO]; OOB eth0:10.233.67.10<0>
mpi-nccl-cluster-1:791:825 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so.
mpi-nccl-cluster-1:791:825 [0] NCCL INFO Using network IB
mpi-nccl-cluster-1:794:827 [0] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
mpi-nccl-cluster-1:794:827 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
mpi-nccl-cluster-1:794:827 [0] NCCL INFO NCCL_IB_HCA set to mlx5_0,mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_7,mlx5_8,mlx5_9
mpi-nccl-cluster-0:1105:1133 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [4]mlx5_4:1/IB [5]mlx5_7:1/IB [6]mlx5_8:1/IB [7]mlx5_9:1/IB [RO]; OOB eth0:10.233.68.10<0>
mpi-nccl-cluster-0:1105:1133 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so.
mpi-nccl-cluster-0:1105:1133 [0] NCCL INFO Using network IB
mpi-nccl-cluster-0:1100:1135 [0] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
mpi-nccl-cluster-0:1100:1135 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
mpi-nccl-cluster-0:1100:1135 [0] NCCL INFO NCCL_IB_HCA set to mlx5_0,mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_7,mlx5_8,mlx5_9
mpi-nccl-cluster-1:794:827 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [4]mlx5_4:1/IB [5]mlx5_8:1/IB [6]mlx5_9:1/IB [RO]; OOB eth0:10.233.67.10<0>
mpi-nccl-cluster-1:794:827 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so.
mpi-nccl-cluster-1:794:827 [0] NCCL INFO Using network IB
mpi-nccl-cluster-0:1100:1135 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [4]mlx5_4:1/IB [5]mlx5_7:1/IB [6]mlx5_8:1/IB [7]mlx5_9:1/IB [RO]; OOB eth0:10.233.68.10<0>
mpi-nccl-cluster-0:1100:1135 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so.
mpi-nccl-cluster-0:1100:1135 [0] NCCL INFO Using network IB
mpi-nccl-cluster-0:1103:1137 [0] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
mpi-nccl-cluster-0:1103:1137 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
mpi-nccl-cluster-0:1103:1137 [0] NCCL INFO NCCL_IB_HCA set to mlx5_0,mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_7,mlx5_8,mlx5_9
mpi-nccl-cluster-0:1103:1137 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [4]mlx5_4:1/IB [5]mlx5_7:1/IB [6]mlx5_8:1/IB [7]mlx5_9:1/IB [RO]; OOB eth0:10.233.68.10<0>
mpi-nccl-cluster-0:1103:1137 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so.
mpi-nccl-cluster-0:1103:1137 [0] NCCL INFO Using network IB
mpi-nccl-cluster-1:796:829 [0] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
mpi-nccl-cluster-1:796:829 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
mpi-nccl-cluster-1:796:829 [0] NCCL INFO NCCL_IB_HCA set to mlx5_0,mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_7,mlx5_8,mlx5_9
mpi-nccl-cluster-0:1106:1139 [0] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
mpi-nccl-cluster-0:1106:1139 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
mpi-nccl-cluster-0:1106:1139 [0] NCCL INFO NCCL_IB_HCA set to mlx5_0,mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_7,mlx5_8,mlx5_9
mpi-nccl-cluster-1:796:829 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [4]mlx5_4:1/IB [5]mlx5_8:1/IB [6]mlx5_9:1/IB [RO]; OOB eth0:10.233.67.10<0>
mpi-nccl-cluster-1:796:829 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so.
mpi-nccl-cluster-1:796:829 [0] NCCL INFO Using network IB
mpi-nccl-cluster-0:1106:1139 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [4]mlx5_4:1/IB [5]mlx5_7:1/IB [6]mlx5_8:1/IB [7]mlx5_9:1/IB [RO]; OOB eth0:10.233.68.10<0>
mpi-nccl-cluster-0:1106:1139 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so.
mpi-nccl-cluster-0:1106:1139 [0] NCCL INFO Using network IB
mpi-nccl-cluster-0:1107:1143 [0] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
...

再往后的日志就是启动进程和性能相关的测试结果了。

评论




正在载入...
PoweredHexo
HostedAliyun
DNSAliyun
ThemeVolantis
UV
PV
BY-NC-SA 4.0