Skip to content
All posts

Large AI Models Inference Speed Doubled, Colossal-Inference Open Source Release

Large AI models have received unprecedented attention in recent years, and have had a profound impact in various application scenarios. Correspondingly, the demand for efficient and highly available large model inference systems is gradually growing, becoming a core challenge for many enterprises.
HPC AI Tech has built an efficient and easy-to-use inference system, Colossal-Inference, which can significantly increase the large AI model inference speed to cope with performance bottlenecks and cost challenges in inference scenarios. The inference engine integrates chunked memory management with paged attention algorithms, preset and custom model optimization strategies, and continuous batch scheduling. The preset strategies, or policies described below, provide high-performance customized operators and third-party acceleration libraries; besides the preset policies, users can build their own custom policies to optimize the specified model by using the base operators and model layers.

Performance Comparison

On one NV H100 80GB SXM GPU, for the LLaMA3 8B model, compared to the offline inference performance of vLLM, the inference throughput can be improved by up to 40%.
For multi-GPU parallel inference, taking the NV H100 80GB SXM 2-tensor parallelism as an example, for the LLaMA3 8B model, the inference throughput can be nearly doubled in some cases compared to the offline inference performance of vLLM.

Colossal-Inference Flow

Initialization of Inference Engine

The user inputs model (Model Address or HuggingFace Model Card) and configurations for initializing the engine, including general settings and limits, inference engine settings, multi-GPU parallel configuration, feature configuration, and preset generation configs. During the inference engine initialization progress, the input model will be sharded and optimized by ColossalAI.ShardFormer; for multi-GPU usage, the sharded model will be distributed to corresponding GPUs in a tensor-parallel manner. And custom optimized operators can be imported via ColossalAI.Extension module. After completing the sharding process and optimization, the inference engine retains the optimized model. Meanwhile, the system pre-allocates activation caches on device(s), initializes the request scheduler, KV cache manager, and other functional components.
截屏2024-05-23 17.04.59

Model Optimization Complete, Ready for Requests

After initialization, the system enters a state where it is waiting to receive user requests. When the user enters a prompt, such as "What are some good places to go in summer?", the inference engine scheduler performs scheduling and uses the optimized model for inference. The inference system selectively executes component modules such as KVCache quantization and speculative decoding, depending on configurations. Finally, the system returns the generated result to the user. For example, "...A good to go place is the beach. A The beach is a great place to relax and soak up the sun. It is also a good place to go for a swim or to play in the water. Some other good places to go in the summer include amusement parks, water parks, and outdoor concerts..." (example generation)

Colossal-Inference Optimization Capabilities

When dealing with large language model inference tasks, inference systems face a number of challenges, including but not limited to computational resource consumption cost, long waiting latency, and inference speed degradation. In order to solve these problems, we expect a more efficient and flexible inference optimization approach. This approach is expected to be optimized for different types of inference tasks, different types of models, and different hardware devices, in order to achieve a significant increase in inference speed and effective use of computing resources. Colossal-Inference integrates the following cutting-edge technologies which could be easily used.

Tensor Parallelism

ColossalAI.Shardformer module allows us to shard the model and replace the operators or model layers with optimized ones during the initialization phase, and to transfer the sharded model weights to corresponding devices based on parallel configuration. Among them, the custom optimization operators are implemented using CUDA and Triton, which have been adapted to the block KV cache mentioned below. The custom model layer is assembled using Python, providing users with flexible custom optimization options.

Blocked Key Value Cache

Inspired by the paging algorithm of the operating system, PagedAttention [1] algorithm introduces the idea of paging in virtual memory into the scope of large language models. We implemented paged attention algorithm with blocked-based KV Cache on specific kernels (CUDA/Triton), which achieves a high performance. This algorithm adopts a block-based KV cache, dividing the KV cache of each sequence into blocks, and each block contains a fixed number of keys and values of tokens. In attention computation, the PagedAttention kernels can efficiently recognize and extract values in these blocks. Due to the fact that blocks do not need to be continuous in memory, we can flexibly manage keys and values, similar to virtual memory management in operating systems. Continuous logical blocks are mapped to non continuous physical blocks through block tables. As new tokens are generated, physical blocks will be allocated as needed. During this process, memory waste mainly occurs in the last block of the sequence. This design is close to optimal memory usage, improving GPU utilization and significantly increasing throughput.

KVCache Quant

In Decoder-Only models, caching the KV matrix can effectively reduce redundant KV matrix calculations. At the same time, KVCache-related calculations under the Decoder structure are in GEMV calculation mode, which is often connected with Memory-Bond performance issues. Therefore, by using low bit types for KVCache storage and quantifying Weight Only, it can effectively save device memory and reduce the data read and write volume of the upstream and downstream kernels of KVCache, further improving model throughput.
As an example, in Colossal-Inference, we currently support the FP8 (E5M2) KVCache format under NV GPUs. Taking the data flow of FP16 as an example, when kernel1 writes to KVCache, it casts the output of FP16 as FP8 type and stores it in KVCache; while kernel2 reads KVCache, it transfers FP8 type data back to FP16 for calculation. Through the above method, the device memory usage can be reduced by half on KVCache.
截屏2024-05-22 16.40.12

Speculative Decoding

The Colossal-Inference engine supports speculative decoding: users can selectively enable this component, by providing a small model, as a drafter model, for speculative speculation, and then use optimized large model as a verifier model for verification of multiple tokens in a parallel way. Additionally, the speculative decoding component supports implementing the GliDe model as a small model, using the KV cache of the large model (verifier model) to participate in its cross attention [2]. Using the gsm8k and mt bench datasets benchmark, the acceleration ratio reached 1.5.


In decoding stage, the attention calculation is converted to the Memory Bond GEMV mode due to the introduction of KVCache. Referring to vllm/FastTransformer, we implemented a high-performance PagedAttention operator, using a specific form of Key Cache Layout ([num_blocks, num_kv_heads, headsize/x, block_size, x]) to optimize the shared memory write efficiency during the Attention calculation process and implement efficient reduce computation in conjunction with corresponding thread mapping methods.

Continuous Batching

Existing large language models (LLMs) typically split the inference process into two stages: prefill and decoding. In prefill stage, input_ids jof a sequence are used to compute and predict a token, storing the corresponding key-value cache. The decoding stage then uses this predicted token and key-value caches for further computation. Traditional inference schemes often ignore this distinction, employing a uniform batch processing method that can lead to inefficiencies, such as idle arithmetic units. This issue is highlighted in the Orca paper[3].
1280X1280 (2)
In order to reduce resource idleness, Continuous Bacthing algorithm is introduced, and there are two most mainstream approaches, one is Orca's Selective-Batching method, which does not distinguish between prefill and decoding, and separates the inputs according to their lengths in the attention computation phase, where inputs with different lengths are executed in batches and the the other operators perform aggregate computations (such like QKV projection). The other is vLLM's approach, which separates prefill and decoding in batching, and performs padding operations for inputs of different lengths in the prefill phase, so that all operators can use the same batching logic.
The Colossal-Inference system absorbs the advantages of the two different approaches, and we similarly distinguish between the computation of prefill and decoding stages, but without the need for additional padding operations.The nopadding operator of the Colossal-AI reconstruction can directly handle one-dimensional inputs. However, due to the variable length of individual sentences input in the prefill phase, which is computationally intensive and takes a long time, a prefill-ratio is specified such that a prefill operation is performed when a certain percentage of the input to be prefilled reaches a certain level. This ensures that the high-speed decoding process is performed most of the time.

Online Serving

Colossal-Inference integrates an online inference service module with FastAPI, supporting both single-sentence completion and multi-round conversation (chat) interface services. Using the api_server module provided by Colossal-Inference, users can conveniently generate local inference service ports and utilize the Colossal-Inference backend to efficiently execute inference tasks.
[1] Kwon, Woosuk, et al. "Efficient Memory Management for Large Language Model Serving with PagedAttention." arXiv preprint arXiv:2309.06180 (2023).
[2] Du, Cunxiao, et al. "GliDe with a CaPE: A Low-Hassle Method to Accelerate Speculative Decoding." arXiv preprint arXiv:2402.02082 (2024)
[3] Yu, Gyeong-In, et al. "Orca: A distributed serving system for {Transformer-Based} generative models." 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). 2022.