Research
Multi-GPU Training of 70B LLM with Deepspeed and FSDP+Qlora
Zain ul Abideen
March 14, 2024
12 min read
Train 70–120B LLM on 4xA100s and 2xRTX3090s (Consumer-grade GPUs)
I have been working with bigger models like Mixtral 8x7B, Qwen-120B, and Miqu-70B recently. But the most important thing when playing with bigger models is the amount of compute resources they require during training...
I have been working with bigger models like Mixtral 8x7B, Qwen-120B, and Miqu-70B recently. But the most important thing when playing with bigger models is the amount of compute resources they require during training. I have been using Deepspeed for multi-GPU training, understanding what difference each stage(Zero-1, Zero-2, Zero-3) brings to the table. I will also be focusing on a recent technique (FSDP+Qlora) for training larger models on consumer-grade GPUs. A few details regarding my recent experiments:
Liberated Miqu 70B
With the release of the new dataset from Abacus AI, I tried out fine-tuning Miqu-70B on SystemChat with 2x A100s and Deepspeed Zero-2. I also tried out Deepspeed Zero-3 but with multiple issues occurring in Axolotl regarding quantization and OOM, I went back to Zero-2. Some highlights of Zero-2 are that it only divides optimizer states and gradients across GPUs but the model params are copied on each GPU while in Zero-3, model weights are also distributed across all GPUs. Liberated Miqu 70B is a totally uncensored model. So be careful with what you use it for. I trained the model for 1 epoch using Qlora with axolotl. The axolotl configuration for this experiment is shown below.
Liberated-Miqu-70B: https://huggingface.co/abideen/Liberated-Miqu-70B
FSDP+Qlora
Answer.ai released a new technique to train bigger models on consumer-grade GPUs (RTX 3090 or 4090) with FSDP and Qlora. Two types of hardware are normally used, one is the data center class hardware, such as H100s and A100s, and others are desktop computers containing gaming GPUs, such as dual 4090s and 3090s. The idea here was simple; figure out how to use these 10x cheaper GPUs to train the best available open-source models. Here is where Answer.ai's fsdp+Qlora comes in handy. I gave FSDP+Qlora a shot with Mixtral 8x7B on 2x 3090s. This technique was also integrated into the Axolotl library on an experimental basis. In Answer.ai's blog, they did not mention anything regarding speed and time constraints with consumer-grade GPUs. I set out training Mixtral on only 100 steps to try things out but the time required for that was 70 hrs which is huge. Since the experiment was taking such a long time, it was not feasible for me to complete this experiment. So currently, I am moving back to using A100s until this technique becomes more efficient. Btw, a great effort by Jeremy Howard and his team to bring the training of larger models to consumer-grade GPUs with limited VRAM. The axolotl config file for this experiment is given below.
MegaQwen-120B
I also tried out the interleaving technique on Qwen-70B to create MegaQwen-120B inspired by Venus-120B. Since a 120B model would have also required an insane amount of VRAM for training, I learned this fact the hard way that you have to fine-tune your 70B model before and then interleave it, thereby bypassing the memory constraints. I tried out interleaving first and then fine-tuning the massive 120B model which ended up with OOM. My prior logic was that a model 120B param requires 240GB VRAM (4bit -> 68GB), I threw 4x A100 i.e. 320GB VRAM and this should work. But, that didn't work out. The main reason was that Zero-2 has copies of entire model parameters on each GPU and Pytorch was somehow taking 12GB leading to OOM on 80GB VRAM A100, so throwing in A100s didn't make any difference. Also, Zero-3 (model params sharding) was not an option due to the errors that it was presenting. I noted these OOM errors and will try to keep track of the memory constraints more vigilantly in the future. The axolotl config for this experiment is available below.
MegaQwen-120B: https://huggingface.co/abideen/MegaQwen-120B
Conclusion
All in all, it was a good experience for me to try out multi-GPU training on the best available open-source models. I tried to work my way through different errors, but some remained unresolved. Will try to solve them in future experiments.
AILLMNLPMachine LearningDeep LearningQueryloop