Multi-Node Setup Guide
Overview
Multi-node training or inference requires a network connection between nodes. This document describes how to set up the network for deployment across nodes.
1. Get Internal IP Addresses of Nodes
For your multi-node launch scripts, you will need the internal IP addresses of all instances.
On each instance, run the following command to obtain the internal IP:
apt install net-tools && ifconfig
2. Configure InfiniBand for NCCL
Ensure that InfiniBand (IB)-related packages are installed and the required environment variables are correctly configured to enable NCCL communication over IB.
Install IB Packages
Run this on each instance:
apt-get install -y infiniband-diags perftest ibverbs-providers
Set NCCL Environment Variables
These variables configure NCCL to work with the IB network:
export NCCL_IB_GID_INDEX=3
export NCCL_SOCKET_IFNAME=eth0
export NCCL_IB_DISABLE=0
# Optional for debugging:
# export CUDA_LAUNCH_BLOCKING=1
# export NCCL_DEBUG=INFO
CUDA_LAUNCH_BLOCKING
andNCCL_DEBUG
are useful for debugging but are not required for production runs.
3. Launch the Multi-Node Task
Once all the above configurations are completed, you can start your multi-node training or inference processes on each instance using your distributed launch script or framework.
Please refer to the documentation of your framework for exact launch commands. Now you are ready to take full advantage of distributed GPU performance!