Multi-Node Setup Guide

Overview

Multi-node training or inference requires a network connection between nodes. This document describes how to set up the network for deployment across nodes.

1. Get Internal IP Addresses of Nodes

For your multi-node launch scripts, you will need the internal IP addresses of all instances.

On each instance, run the following command to obtain the internal IP:

apt install net-tools && ifconfig

2. Configure InfiniBand for NCCL

Ensure that InfiniBand (IB)-related packages are installed and the required environment variables are correctly configured to enable NCCL communication over IB.

Install IB Packages

Run this on each instance:

apt-get install -y infiniband-diags perftest ibverbs-providers

Set NCCL Environment Variables

These variables configure NCCL to work with the IB network:

export NCCL_IB_GID_INDEX=3
export NCCL_SOCKET_IFNAME=eth0
export NCCL_IB_DISABLE=0

# Optional for debugging:
# export CUDA_LAUNCH_BLOCKING=1
# export NCCL_DEBUG=INFO

CUDA_LAUNCH_BLOCKING and NCCL_DEBUG are useful for debugging but are not required for production runs.

3. Launch the Multi-Node Task

Once all the above configurations are completed, you can start your multi-node training or inference processes on each instance using your distributed launch script or framework.

Please refer to the documentation of your framework for exact launch commands. Now you are ready to take full advantage of distributed GPU performance!

Overview​

1. Get Internal IP Addresses of Nodes​

2. Configure InfiniBand for NCCL​

Install IB Packages​

Set NCCL Environment Variables​

3. Launch the Multi-Node Task​