后续会验证gpu-operator
和network-operator
的使用
环境
kubernetes:1.30
IB驱动版本:MLNX_OFED_LINUX-24.10-1.1.4.0
驱动查看
$ ofed_info -s
MLNX_OFED_LINUX-24.10-1.1.4.0
rdma-share-device-plugin安装
目前使用的版本应该1.5.2
,我直接使用的master分支。
git clone https://github.com/Mellanox/k8s-rdma-shared-dev-plugin.git
cd k8s-rdma-shared-dev-plugin/deployment/k8s/base
修改configmap
查看IB设备与网络接口的对应关系
$ ibdev2netdev
mlx5_0 port 1 ==> ibp26s0 (Up)
mlx5_1 port 1 ==> ibp60s0 (Up)
mlx5_2 port 1 ==> ibp77s0 (Up)
mlx5_3 port 1 ==> ibp94s0 (Up)
将节点上的网络接口映射为k8s资源。根据接口名不同,只需要修改ifNames字段即可,条件允许就配置多个resourceName,可以验证单节点不同卡和跨节点的通信。
apiVersion: v1
kind: ConfigMap
metadata:
name: rdma-devices
namespace: kube-system
data:
config.json: |
{
"configList": [
{
"resourceName": "rdma_shared_device_a",
"rdmaHcaMax": 63,
"selectors": {
"vendors": [],
"deviceIDs": [],
"drivers": [],
"ifNames": ["ibp26s0","ibp60s0"],
"linkTypes": []
}
},
{
"resourceName": "rdma_shared_device_b",
"rdmaHcaMax": 63,
"selectors": {
"vendors": [],
"deviceIDs": [],
"drivers": [],
"ifNames": ["ibp77s0","ibp94s0"],
"linkTypes": []
}
}
]
}
安装
# 给有ib卡的节点设置label
$ kubectl label node gpu01 rdma=true --overwrite
$ pwd
k8s-rdma-shared-dev-plugin/deployment/k8s/base
$ kubectl apply -k .
等待插件正常running
$ kubectl get po -n kube-system -l name=rdma-shared-dp-ds
NAME READY STATUS RESTARTS AGE
rdma-shared-dp-ds-cfwzj 1/1 Running 0 3h28m
如果报错有 pci.ids
相关内容,则需要挂载 pci.ids
文件
2025/03/07 02:40:54 Starting K8s RDMA Shared Device Plugin version= master
Using Kubelet Plugin Registry Mode
2025/03/07 02:40:54 resource manager reading configs
2025/03/07 02:40:54 Reading /k8s-rdma-shared-dev-plugin/config.json
2025/03/07 02:40:54 loaded config: [{ResourceName:hca_shared_devices_a ResourcePrefix: RdmaHcaMax:1000 Devices:[ibp26s0 ibp60s0] Selectors:{Vendors:[] DeviceIDs:[] Drivers:[] IfNames:[] LinkTypes:[]}} {ResourceName:hca_shared_devices_b ResourcePrefix: RdmaHcaMax:500 Devices:[] Selectors:{Vendors:[15b3] DeviceIDs:[1021] Drivers:[] IfNames:[ibp94s0 ibp77s0] LinkTypes:[]}}]
2025/03/07 02:40:54 periodic update interval: +300
2025/03/07 02:40:54 Warning: "devices" field is deprecated, it is recommended to use the new “selectors” field
2025/03/07 02:40:54 Discovering host devices
2025/03/07 02:40:54 discovering host network devices
2025/03/07 02:40:54 Error: error discovering host devices error getting PCI info: No pci-ids DB files found (and network fetch disabled)
修改daemonset.yaml
...
spec:
# 增加了nodelabel,不然会在没有ib卡的节点上也部署这个daemonset
nodeSelector:
rdma: 'true'
containers:
volumeMounts:
...
- name: pci-ids
mountPath: /usr/share/misc/pci.ids
volumes:
...
- name: pci-ids
hostPath:
path: /usr/share/misc/pci.ids
测试IB连通性
rdma-share-device-plugin
官方的测试镜像不太好用,这里使用三方作者的镜像
pod1
# mofed-test-pod
apiVersion: v1
kind: Pod
metadata:
name: mofed-test-pod
spec:
restartPolicy: OnFailure
containers:
#- image: mellanox/rping-test
- image: dtnaas/ofed:5.4-3
name: mofed-test-ctr
securityContext:
capabilities:
add: [ "IPC_LOCK", "SYS_RESOURCE" ]
resources:
limits:
rdma/rdma_shared_device_a: 1
command:
- sh
- -c
- |
ls -l /dev/infiniband /sys/class/infiniband /sys/class/net
sleep 1000000
创建pod
kubectl create -f mofed-test-pod.yaml
pod2
# mofed-test-pod1
apiVersion: v1
kind: Pod
metadata:
name: mofed-test-pod1
spec:
restartPolicy: OnFailure
containers:
#- image: mellanox/rping-test
- image: dtnaas/ofed:5.4-3
name: mofed-test-ctr
securityContext:
capabilities:
add: [ "IPC_LOCK", "SYS_RESOURCE" ]
resources:
limits:
rdma/rdma_shared_device_b: 1 # 我这里使用的和pod1是不一样的
command:
- sh
- -c
- |
ls -l /dev/infiniband /sys/class/infiniband /sys/class/net
sleep 1000000
创建pod
kubectl create -f mofed-test-pod1.yaml
IB测试
$ kubectl exec -it mofed-test-pod -- bash
# 确认容器内申请到的ib device
$ ibv_devices
device node GUID
------ ----------------
mlx5_0 5c257303000c0158
mlx5_1 5c2573030006c3da
$ kubectl exec -it mofed-test-pod1 -- bash
$ ibv_devices
device node GUID
------ ----------------
mlx5_2 5c257303000c0298
mlx5_3 5c25730300019b52
mofed-test-pod 运行服务端
# 注意mlx5_0是pod内可以看到的,使用mlx5_0建立服务端
root@mofed-test-pod:/# ib_write_bw -d mlx5_0 -a -F
************************************
* Waiting for client to connect... *
************************************
mofed-test-pod1 运行客户端
# 注意mlx5_2是pod内可以看到的,使用mlx5_2去连服务端的mlx5_0
# 这里的ip是服务端的eth0的ip
root@mofed-test-pod1:/# ib_write_bw -F -d mlx5_2 10.233.68.10 -D 10 --cpu_util --report_gbits
客户端执行后,服务端和客户端都会输出如下带宽速率
############ mofed-test-pod
************************************
* Waiting for client to connect... *
************************************
---------------------------------------------------------------------------------------
RDMA_Write BW Test
Dual-port : OFF Device : mlx5_0
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
PCIe relax order: ON
ibv_wr* API : ON
CQ Moderation : 100
Mtu : 4096[B]
Link type : IB
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0x99 QPN 0x0061 PSN 0x6a8e46 RKey 0x1fff00 VAddr 0x007fb0cbf27000
remote address: LID 0x7f QPN 0x0047 PSN 0xd4d904 RKey 0x1fff00 VAddr 0x007fb471b5e000
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]
65536 4182819 0.00 365.49 0.697124
---------------------------------------------------------------------------------------
############ mofed-test-pod1
---------------------------------------------------------------------------------------
RDMA_Write BW Test
Dual-port : OFF Device : mlx5_2
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
PCIe relax order: ON
ibv_wr* API : ON
TX depth : 128
CQ Moderation : 1
Mtu : 4096[B]
Link type : IB
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0x7f QPN 0x0047 PSN 0xd4d904 RKey 0x1fff00 VAddr 0x007fb471b5e000
remote address: LID 0x99 QPN 0x0061 PSN 0x6a8e46 RKey 0x1fff00 VAddr 0x007fb0cbf27000
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps] CPU_Util[%]
65536 4182819 0.00 365.49 0.697124 0.61
---------------------------------------------------------------------------------------
以上则表示IB通信成功,速率为365G。这是400G的IB网卡
nccl测试
最终分布式的模型部署是使用nccl的库来调用IB通信的,所以使用 nccl 测试 GPU+IB 的跨机通信
构建有ibtools的cuda镜像:nccl+mpi,nv的cuda镜像拉取需要ngc key
# laiye-aifoundry-registry.cn-beijing.cr.aliyuncs.com/laiye-foundry/nvcr.io/nvidia/cuda:12.8.0-devel-ubuntu22.04
FROM nvcr.io/nvidia/cuda:12.8.0-devel-ubuntu22.04
RUN apt-get update && \
apt-get install -y --no-install-recommends \
git \
libibverbs1 \
librdmacm1 \
ibverbs-providers \
ibverbs-utils \
libibumad3 \
libmlx5-1 \
librdmacm-dev \
libibverbs-dev \
iproute2 \
net-tools \
openssh-client \
infiniband-diags \
perftest \
rdma-core \
wget \
vim \
&& rm -rf /var/lib/apt/lists/*
# ENV http_proxy=http://172.16.10.116:10809
# ENV https_proxy=http://172.16.10.116:10809
# mpi
ENV OMPI_VERSION=5.0.7
RUN mkdir /tmp/openmpi && cd /tmp/openmpi && \
wget https://download.open-mpi.org/release/open-mpi/v5.0/openmpi-${OMPI_VERSION}.tar.gz && \
tar zxf openmpi-${OMPI_VERSION}.tar.gz && \
cd openmpi-${OMPI_VERSION} && \
./configure --prefix=/usr/local \
--with-cuda=/usr/local/cuda \
--with-verbs \
--enable-mca-no-build=btl-uct \
--enable-orterun-prefix-by-default \
--without-psm \
--without-psm2 \
--without-libfabric \
--with-rdma=libibverbs && \
make -j$(nproc) && make install && \
ldconfig && \
rm -rf /tmp/openmpi
# nccl
WORKDIR /workspace
RUN git clone https://github.com/NVIDIA/nccl-tests.git && \
cd nccl-tests && \
make MPI=0 && \
mv build/* /usr/local/bin/ && \
cd .. && rm -rf nccl-tests
ENV LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:$LD_LIBRARY_PATH
ENV NCCL_DEBUG=INFO
构建镜像
docker build -t 172.16.10.3:5000/laiye-foundry/nccl_test:cuda12.8-ubuntu2204_00 .
mpirun statefulset
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: mpi-nccl-cluster
spec:
serviceName: mpi-nccl-svc
replicas: 2 # 根据实际节点数调整,我这里是两台8卡gpu+8卡ib
selector:
matchLabels:
app: mpi-nccl
template:
metadata:
labels:
app: mpi-nccl
spec:
dnsPolicy: ClusterFirstWithHostNet
containers:
- name: mpi-node
image: your-registry/mpi-nccl:latest
securityContext:
capabilities:
add: ["IPC_LOCK", "SYS_ADMIN", "NET_ADMIN"]
env:
- name: LD_LIBRARY_PATH
value: /usr/local/lib:/usr/local/cuda/lib64:/usr/lib64/mpi/gcc/openmpi/lib64
ports:
- containerPort: 2222 # SSH 端口
resources:
limits:
nvidia.com/gpu: 2
rdma/rdma_shared_device_a: 1
nccl IB+GPU测试
# 在主节点生成密钥
ssh-keygen -t rsa
# 两个pod配置root密码passwd
# 两个pod允许root登录
# 启动sshd
nohup /usr/sbin/sshd -D > /dev/null 2>&1 &
# 将公钥复制到所有 Pod
ssh-copy-id -i ~/.ssh/id_rsa.pub -p 2222 root@10.233.67.10
ssh-copy-id -i ~/.ssh/id_rsa.pub -p 2222 root@10.233.68.10
# 配置节点slot
cat host.txt
10.233.67.10 slots=8
10.233.68.10 slots=8
mpirun -np 16 \
--hostfile host.txt \
--allow-run-as-root \
--mca plm_rsh_args "-p 2222" \
--mca btl_tcp_if_include eth0 \
-x NCCL_SOCKET_IFNAME=eth0 \
-x NCCL_IB_HCA=mlx5_0,mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_7,mlx5_8,mlx5_9 \
-x NCCL_DEBUG=INFO \
-x CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
/usr/local/bin/all_reduce_perf -b 1M -e 1G -f 2 -g 1 -c 0
有如下类似日志则表示成功
# 这里只展示关建部分,至少表示调用IB成功了
...
mpi-nccl-cluster-1:795:823 [0] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
mpi-nccl-cluster-1:795:823 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
mpi-nccl-cluster-1:795:823 [0] NCCL INFO NCCL_IB_HCA set to mlx5_0,mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_7,mlx5_8,mlx5_9
mpi-nccl-cluster-1:797:797 [0] NCCL INFO NCCL version 2.25.1+cuda12.8
mpi-nccl-cluster-1:792:792 [0] NCCL INFO NCCL version 2.25.1+cuda12.8
mpi-nccl-cluster-1:795:823 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [4]mlx5_4:1/IB [5]mlx5_8:1/IB [6]mlx5_9:1/IB [RO]; OOB eth0:10.233.67.10<0>
mpi-nccl-cluster-1:795:823 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so.
mpi-nccl-cluster-1:795:823 [0] NCCL INFO Using network IB
mpi-nccl-cluster-1:791:825 [0] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
mpi-nccl-cluster-1:791:825 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
mpi-nccl-cluster-1:791:825 [0] NCCL INFO NCCL_IB_HCA set to mlx5_0,mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_7,mlx5_8,mlx5_9
mpi-nccl-cluster-0:1105:1133 [0] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
mpi-nccl-cluster-0:1105:1133 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
mpi-nccl-cluster-0:1105:1133 [0] NCCL INFO NCCL_IB_HCA set to mlx5_0,mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_7,mlx5_8,mlx5_9
mpi-nccl-cluster-1:791:825 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [4]mlx5_4:1/IB [5]mlx5_8:1/IB [6]mlx5_9:1/IB [RO]; OOB eth0:10.233.67.10<0>
mpi-nccl-cluster-1:791:825 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so.
mpi-nccl-cluster-1:791:825 [0] NCCL INFO Using network IB
mpi-nccl-cluster-1:794:827 [0] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
mpi-nccl-cluster-1:794:827 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
mpi-nccl-cluster-1:794:827 [0] NCCL INFO NCCL_IB_HCA set to mlx5_0,mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_7,mlx5_8,mlx5_9
mpi-nccl-cluster-0:1105:1133 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [4]mlx5_4:1/IB [5]mlx5_7:1/IB [6]mlx5_8:1/IB [7]mlx5_9:1/IB [RO]; OOB eth0:10.233.68.10<0>
mpi-nccl-cluster-0:1105:1133 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so.
mpi-nccl-cluster-0:1105:1133 [0] NCCL INFO Using network IB
mpi-nccl-cluster-0:1100:1135 [0] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
mpi-nccl-cluster-0:1100:1135 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
mpi-nccl-cluster-0:1100:1135 [0] NCCL INFO NCCL_IB_HCA set to mlx5_0,mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_7,mlx5_8,mlx5_9
mpi-nccl-cluster-1:794:827 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [4]mlx5_4:1/IB [5]mlx5_8:1/IB [6]mlx5_9:1/IB [RO]; OOB eth0:10.233.67.10<0>
mpi-nccl-cluster-1:794:827 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so.
mpi-nccl-cluster-1:794:827 [0] NCCL INFO Using network IB
mpi-nccl-cluster-0:1100:1135 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [4]mlx5_4:1/IB [5]mlx5_7:1/IB [6]mlx5_8:1/IB [7]mlx5_9:1/IB [RO]; OOB eth0:10.233.68.10<0>
mpi-nccl-cluster-0:1100:1135 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so.
mpi-nccl-cluster-0:1100:1135 [0] NCCL INFO Using network IB
mpi-nccl-cluster-0:1103:1137 [0] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
mpi-nccl-cluster-0:1103:1137 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
mpi-nccl-cluster-0:1103:1137 [0] NCCL INFO NCCL_IB_HCA set to mlx5_0,mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_7,mlx5_8,mlx5_9
mpi-nccl-cluster-0:1103:1137 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [4]mlx5_4:1/IB [5]mlx5_7:1/IB [6]mlx5_8:1/IB [7]mlx5_9:1/IB [RO]; OOB eth0:10.233.68.10<0>
mpi-nccl-cluster-0:1103:1137 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so.
mpi-nccl-cluster-0:1103:1137 [0] NCCL INFO Using network IB
mpi-nccl-cluster-1:796:829 [0] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
mpi-nccl-cluster-1:796:829 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
mpi-nccl-cluster-1:796:829 [0] NCCL INFO NCCL_IB_HCA set to mlx5_0,mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_7,mlx5_8,mlx5_9
mpi-nccl-cluster-0:1106:1139 [0] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
mpi-nccl-cluster-0:1106:1139 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
mpi-nccl-cluster-0:1106:1139 [0] NCCL INFO NCCL_IB_HCA set to mlx5_0,mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_7,mlx5_8,mlx5_9
mpi-nccl-cluster-1:796:829 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [4]mlx5_4:1/IB [5]mlx5_8:1/IB [6]mlx5_9:1/IB [RO]; OOB eth0:10.233.67.10<0>
mpi-nccl-cluster-1:796:829 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so.
mpi-nccl-cluster-1:796:829 [0] NCCL INFO Using network IB
mpi-nccl-cluster-0:1106:1139 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [4]mlx5_4:1/IB [5]mlx5_7:1/IB [6]mlx5_8:1/IB [7]mlx5_9:1/IB [RO]; OOB eth0:10.233.68.10<0>
mpi-nccl-cluster-0:1106:1139 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so.
mpi-nccl-cluster-0:1106:1139 [0] NCCL INFO Using network IB
mpi-nccl-cluster-0:1107:1143 [0] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
...
再往后的日志就是启动进程和性能相关的测试结果了。