Research
Best LLM Inference Engine? TensorRT vs vLLM vs LMDeploy vs MLC-LLM
Zain ul Abideen
July 7, 2024
15 min read
Benchmarking various LLM Inference Engines.
LLMs excel in text generation applications, such as chat and code completion models capable of high understanding and fluency. However, their large size also creates challenges for inference. Basic inference is slow because LLMs generate text tokens by token, requiring repeated calls for each next token. As the input sequence grows, the processing time increases. Additionally, LLMs have billions of parameters, making it difficult to store and manage all those weights in memory. In the effort to optimize LLM inference and serving, there are multiple frameworks and packages and in this blog, I'll use and compare the following inference engines TensorRT-LLM vLLM LMDeploy MLC-LLM
1. TensorRT-LLM
Introduction
TensorRT-LLM is another inference engine that accelerates and optimizes inference performance for the latest LLMs on NVIDIA GPUs. LLMs are compiled into TensorRT Engine and then deployed with a triton server to leverage inference optimizations such as In-Flight Batching (reduces wait time and allows higher GPU utilization), paged KV caching, MultiGPU-MultiNode Inference, and FP8 Support.
Usage
We will compare the execution time, ROUGE scores, latency, and throughput across the HF model, TensorRT-model, and TensorRT-INT8 model (quantized).
You need to install Nvidia-container-toolkit for your Linux system, initialize Git LFS (to download HF Models), and download the necessary packages as follows:
Now retrieve the model weights
PHI_PATH="TensorRT-LLM/examples/phi"
!rm -rf $PHI_PATH/7B
!mkdir -p $PHI_PATH/7B && git clone https://huggingface.co/microsoft/Phi-3-small-128k-instruct $PHI_PATH/7B
Convert the model into TensorRT-LLM checkpoint format and and build the TensorRT-LLM from the checkpoint.
Similarly, now apply INT8 weight-only quantization to the HF model and convert the checkpoint into TensorRT-LLM.
Now test the base phi3 and two TensorRT models on the summarization task
Now after capturing the results, you can parse the output and plot it to compare execution time, ROUGE scores, latency, and throughput across all models.
Comparison of Latency and Throughput
2. vLLM
Introduction
vLLM offers LLM inferencing and serving with SOTA throughput, Paged Attention, Continuous batching, Quantization (GPTQ, AWQ, FP8), and optimized CUDA kernels.
Usage
Let's evaluate the throughput and latency of microsoft/Phi3-mini-4k-instruct . Start by setting up dependencies and importing libraries.
Now let's load the model and generate its outputs on a small slice of the dataset.
Let's also benchmark the model's performance through vLLM on the ShareGPT dataset
3. LMDeploy
Introduction
This package also allows compressing, deploying, and serving LLMs while offering efficient inference (persistent batching, blocked KV cache, dynamic split&fuse, tensor parallelism, high-performance CUDA kernels), effective quantization (4-bit inference performance is 2.4x higher than FP16), effortless distribution server (deployment of multi-model services across multiple machines and cards) and interactive inference mode (remembers dialogue history and avoids repetitive processing of historical sessions). Furthermore, it also allows for profiling token latency and throughput, request throughput, API server, and triton inference server performance.
Usage
Install dependencies and import packages.
Let's profile the PyTorch engine on microsoft/Phi3-mini-128k-instruct .
Pytorch engine profile for token latency and throughput
4. MLC-LLM
Introduction
MLC-LLM offers a high performance deployment and inference engine, called MLCEngine.
Usage
Let's install dependencies which includes setting up dependencies with conda and creating a conda environment. Then clone the git repository and configure.
To run a model with MLC LLM, we need to convert model weights into MLC format.
Now load you MLC format model into the MLC engine
Summary
TensorRT INT8 models outperform HF models and regular TensorRT with respect to inference speed while the regular TensorRT model performed better on the summarization task with the highest ROUGE score among the three models. LMDeploy delivers up to 1.8x higher request throughput than vLLM on an A100.
AIMachine LearningDeep LearningLLMInference EngineTensorRTvLLMLMDeployMLC-LLM