Skip to content
All posts

Most cost-effective solutions for inference, fine-tuning and pretraining, tailored to LLaMA3 series

Meta's release of the LLaMA3 8B and LLaMA3 70B represents a significant advancement in open-source large AI models. These models have demonstrated superior performance across various benchmark tests, establishing them as the premier choice for AI applications.
Colossal-AI now provides comprehensive solutions for inference, fine-tuning, and pretraining specifically tailored to the LLaMA3 8B and LLaMA3 70B models.
When deployed on 64 H100 GPUs, Colossal-AI showcases remarkable improvements in training efficiency for the LLaMA3 70B model, surpassing competing solutions like Megatron+DeepSpeed. It achieves nearly a 20% enhancement in training efficiency and outperforms alternative inference solutions such as vLLM.
Colossal-AI open source address:


Colossal-Inference now supports LLaMA-3 inference acceleration.
Screenshot 2024-04-24 133950 (1)
Comparison of benchmark results of LLaMA3-8B on one H100 GPU with vLLM (Example: Input Sequence Length 128, Output Length 256)

LLaMA3 Generation

Run the generation script
# huggingface or local model path
cd ColossalAI/examples/inference/
colossalai run --nproc_per_node 1 -m $PRETRAINED_MODEL_PATH --max_length 80
Run multi-GPU inference (Tensor Parallelism), as in the following example using 2 GPUs.
colossalai run --nproc_per_node 2 -m $PRETRAINED_MODEL_PATH --max_length 80 --tp_size 2

Throughput Scripts

Run the throughput Benchmark test
git pull # update example benchmark from branch feature/colossal-infer
cd ColossalAI/examples/inference/
python -m llama3-8b -b 32 -s 128 -o 256 -p $PRETRAINED_MODEL_PATH

Finetuning and Continual Pretraining

Preliminary Preparation

  1. Install
git clone
cd flash-attention
# At flash-attention root folder
cd csrc/xentropy && pip install .
cd csrc/layer_norm && pip install .
cd csrc/rotary && pip install .

# At local root folder, same level as flash attention
  1. Create folders required for training
# At local root folder, same level as flash attention, ColossalAI
mkdir training_outputs
mkdir training_outputs/checkpoints
mkdir training_outputs/configs
mkdir training_outputs/tensorboards
  1. Modify hostfile
cd ColossalAI/applications/Colossal-LLaMA/
cp hostfile.example hostfile
# update hostfile

Data Preparation

  1. Continual Pretraining
{"source": "", "target": "Lionel Andrés Messi(Spanish pronunciation: [ljoˈnel anˈdɾes ˈmesi] (i); born 24 June 1987), also known as Leo Messi, is an Argentine professional footballer who plays as a forward for and captains both Major League Soccer club Inter Miami and the Argentina national team.", "category": "sports"}
{"source": "猜谜语:一身卷卷细毛,吃的青青野草,过了数九寒冬,无私献出白毛。(打一动物)", "target": "白羊", "category": "riddle"}
python \
--data_input_dirs "Directory that contains one or more JSONL file(s)>" \
--tokenizer_dir "Tokenizer directory" \
--data_output_dirs "Spliced tokenized output" \
--max_length 8192 \
--num_spliced_dataset_bins 10
  1. Finetuning
{"messages": [{"from": "human", "content": "What are the three primary colors?"}, {"from": "assistant", "content": "The three primary colors are red, blue, and yellow."}]}
{"messages": [{"from": "human", "content": "解释个人电脑和服务器之间的区别。"}, {"from": "assistant", "content": "个人电脑和服务器是两种不同类型的计算机系统,它们的主要区别在于用途、硬件配置和性能。 个人电脑,顾名思义,是为个人使用而设计的计算机。它们通常用于日常的工作、娱乐和学习,可以运行各种各样的应用程序和游戏。个人电脑的硬件配置一般是按照标准配置来设计的,不过也可以根据个人需求进行定制。 而服务器是为了满足大量用户的需求而设计的计算机系统,它们通常用于为用户提供各种网络服务,如网站、电子邮件和文件传输等。服务器通常需要高性能的硬件配置,并且可以承受高负载和长时间的运行。由于服务器需要支持大量用户的访问,它们通常配备多核处理器、大容量内存和大容量硬盘驱动器,以提高系统的运行速度和稳定性。 总之,个人电脑和服务器之间的主要区别在于它们的用途、硬件配置和性能。个人电脑用于个人使用,而服务器用于支持大量用户的访问。服务器的硬件配置通常比个人电脑更高,以保证系统的性能和稳定性。"}]}
python \
--data_input_dirs "Directory that contains one or more JSONL file(s)>" \
--tokenizer_dir "Tokenizer directory" \
--data_output_dirs "Spliced tokenized output" \
--max_length 8192 \
--num_spliced_dataset_bins 10 \
--llama_version 3
Upon completion of a successful operation, the system automatically generates three subfolders within the designated data_output_dirs directory. These subfolders contain data suitable for direct training, particularly the contents of the arrow folder.

Training Scripts

Current path:/root/ColossalAI/ColossalAI/applications/Colossal-LLaMA/
  1. Modify the config file
#Update training scripts
  1. Training scripts for reference
# xxx indicate your local path to the directory that contains ColossalAI and training_outputs
PARENT_SAVE_DIR="xxx/training_outputs/checkpoints" # Path to a folder to save checkpoints
PARENT_TENSORBOARD_DIR="xxx/training_outputs/tensorboards" # Path to a folder to save logs
PARENT_CONFIG_FILE="xxx/training_outputs/configs" # Path to a folder to save training config logs
PRETRAINED_MODEL_PATH="" # huggingface or local model path

# Taking the pre-set processed dataset as an example
# xxx indicate your local path to the directory that contains processed arrow folders
declare -a dataset=(

TIMESTAMP=$(date +%Y-%m-%d-%H-%M-%S)

colossalai run --nproc_per_node 8 --hostfile hostfile --master_port 31312 \
--dataset ${dataset[@]} \
--plugin "zero2" \
--save_interval 400 \
--save_dir $SAVE_DIR \
--tensorboard_dir $TENSORBOARD_DIR \
--config_file $CONFIG_FILE \
--num_epochs 1 \
--micro_batch_size 2 \
--lr 1e-4 \
--mixed_precision "bf16" \
--grad_clip 1.0 \
--weight_decay 0.01 \
--warmup_steps 100 \
--use_grad_checkpoint \
--use_flash_attn \
More details can be found at:

Large-scale training

In the context of large-scale pre-training, we leverage the distinctive attributes of LLaMA3, including extended sequence lengths and augmented embeddings, to optimize for 3D hybrid parallel environments. Through tailored adjustments in pipeline partitioning and gradient checkpoint strategies, we enhance control over GPU memory consumption and processing speed, thereby augmenting overall training efficiency.
image (27)-1
Employing integer linear programming, we systematically identified the optimal partitioning and gradient checkpoint strategy tailored for LLaMA3-70B across 64 H100 GPUs. The resultant training regimen yielded exceptional performance, surpassing 410+ TFLOPS per GPU.
More details can be found at:
This example includes the configuration we used during testing.
git clone
cd ColossalAI/examples/language/llama
BUILD_EXT=1 pip install -U git+
pip install -r requirements.txt
export PYTHONPATH=$(realpath ..)
colossalai run --nproc_per_node 8 --hostfile HOSTFILE -c Meta-Llama-3-70B -x -g -p 3d --tp 4 --pp 4 --zero 1 -l 8192 --mbs 2 -b 128 --custom-ckpt
Colossal-AI open source address: