Queryloop

Research

Multi-GPU Training of 70B LLM with Deepspeed and FSDP+Qlora

Zain ul Abideen
March 14, 2024
12 min read

Train 70–120B LLM on 4xA100s and 2xRTX3090s (Consumer-grade GPUs)

I have been working with bigger models like Mixtral 8x7B, Qwen-120B, and Miqu-70B recently. But the most important thing when playing with bigger models is the amount of compute resources they require during training...

I have been working with bigger models like Mixtral 8x7B, Qwen-120B, and Miqu-70B recently. But the most important thing when playing with bigger models is the amount of compute resources they require during training. I have been using Deepspeed for multi-GPU training, understanding what difference each stage(Zero-1, Zero-2, Zero-3) brings to the table. I will also be focusing on a recent technique (FSDP+Qlora) for training larger models on consumer-grade GPUs. A few details regarding my recent experiments:

Liberated Miqu 70B

With the release of the new dataset from Abacus AI, I tried out fine-tuning Miqu-70B on SystemChat with 2x A100s and Deepspeed Zero-2. I also tried out Deepspeed Zero-3 but with multiple issues occurring in Axolotl regarding quantization and OOM, I went back to Zero-2. Some highlights of Zero-2 are that it only divides optimizer states and gradients across GPUs but the model params are copied on each GPU while in Zero-3, model weights are also distributed across all GPUs. Liberated Miqu 70B is a totally uncensored model. So be careful with what you use it for. I trained the model for 1 epoch using Qlora with axolotl. The axolotl configuration for this experiment is shown below.
1base_model: 152334H/miqu-1-70b-sf
2model_type: LlamaForCausalLM
3tokenizer_type: LlamaTokenizer
4load_in_8bit: false
5load_in_4bit: true
6strict: false
7
8datasets:
9- path: abacusai/SystemChat
10type: sharegpt
11dataset_prepared_path:
12val_set_size: 0
13output_dir: /workspace/miqu-systemchat
14resume_from_checkpoint:
15hf_use_auth_token:
16adapter: qlora
17lora_model_dir:
18sequence_len: 2048
19sample_packing: true
20pad_to_sequence_len: true
21lora_r: 16
22lora_alpha: 16
23lora_dropout: 0.05
24lora_target_modules:
25lora_target_linear: true
26lora_fan_in_fan_out:
27lora_modules_to_save:
28- embed_tokens
29- lm_head
30wandb_project: Miqu-Systemchat-multiGPU
31wandb_entity:
32wandb_watch:
33wandb_run_id:
34wandb_log_model:
35gradient_accumulation_steps: 1
36micro_batch_size: 1
37num_epochs: 1
38optimizer: paged_adamw_8bit
39lr_scheduler: cosine
40learning_rate: 0.0002
41train_on_inputs:
42group_by_length: false
43bf16: true
44fp16: false
45tf32: false
46gradient_checkpointing: true
47early_stopping_patience:
48local_rank:
49logging_steps: 1
50xformers_attention:
51flash_attention: true
52warmup_steps: 100
53eval_steps:
54save_steps: 2000
55save_total_limit: 2
56eval_sample_packing:
57debug:
58deepspeed: deepspeed_configs/zero2.json
59weight_decay: 0.05
60fsdp:
61fsdp_config:
62special_tokens:
63tokens:
64trust_remote_code: true
Liberated-Miqu-70B: https://huggingface.co/abideen/Liberated-Miqu-70B

FSDP+Qlora

Answer.ai released a new technique to train bigger models on consumer-grade GPUs (RTX 3090 or 4090) with FSDP and Qlora. Two types of hardware are normally used, one is the data center class hardware, such as H100s and A100s, and others are desktop computers containing gaming GPUs, such as dual 4090s and 3090s. The idea here was simple; figure out how to use these 10x cheaper GPUs to train the best available open-source models. Here is where Answer.ai's fsdp+Qlora comes in handy. I gave FSDP+Qlora a shot with Mixtral 8x7B on 2x 3090s. This technique was also integrated into the Axolotl library on an experimental basis. In Answer.ai's blog, they did not mention anything regarding speed and time constraints with consumer-grade GPUs. I set out training Mixtral on only 100 steps to try things out but the time required for that was 70 hrs which is huge. Since the experiment was taking such a long time, it was not feasible for me to complete this experiment. So currently, I am moving back to using A100s until this technique becomes more efficient. Btw, a great effort by Jeremy Howard and his team to bring the training of larger models to consumer-grade GPUs with limited VRAM. The axolotl config file for this experiment is given below.
1base_model: mistralai/Mixtral-8x7B-v0.1
2model_type: AutoModelForCausalLM
3tokenizer_type: LlamaTokenizer
4trust_remote_code: true
5
6load_in_8bit: false
7load_in_4bit: true
8strict: false
9datasets:
10 - path: cognitivecomputations/WizardLM_evol_instruct_V2_196k_unfiltered_merged_split
11 type: sharegpt
12 conversation: chatml
13dataset_prepared_path: last_run_prepared
14val_set_size: 0.02
15output_dir: ./qlora-out
16model_config:
17 output_router_logits: true
18adapter: qlora
19lora_model_dir:
20sequence_len: 1024
21sample_packing: false
22pad_to_sequence_len: false
23lora_r: 16
24lora_alpha: 16
25lora_dropout: 0.05
26lora_target_linear: true
27lora_fan_in_fan_out:
28wandb_project: fsdp
29wandb_entity:
30wandb_watch:
31wandb_name:
32wandb_log_model:
33gradient_accumulation_steps: 4
34micro_batch_size: 2
35num_epochs: 1
36max_steps: 100
37optimizer: paged_adamw_8bit
38lr_scheduler: cosine
39learning_rate: 0.0002
40train_on_inputs: false
41group_by_length: false
42bf16: auto
43fp16:
44tf32: false
45gradient_checkpointing: true
46early_stopping_patience:
47resume_from_checkpoint:
48local_rank:
49logging_steps: 1
50xformers_attention:
51flash_attention: true
52loss_watchdog_threshold: 5.0
53loss_watchdog_patience: 3
54warmup_steps: 10
55evals_per_epoch: 4
56eval_table_size:
57eval_max_new_tokens: 128
58saves_per_epoch: 1
59debug:
60weight_decay: 0.0
61fsdp:
62 - full_shard
63fsdp_config:
64 fsdp_transformer_layer_cls_to_wrap: MixtralSparseMoeBlock
65special_tokens:

MegaQwen-120B

I also tried out the interleaving technique on Qwen-70B to create MegaQwen-120B inspired by Venus-120B. Since a 120B model would have also required an insane amount of VRAM for training, I learned this fact the hard way that you have to fine-tune your 70B model before and then interleave it, thereby bypassing the memory constraints. I tried out interleaving first and then fine-tuning the massive 120B model which ended up with OOM. My prior logic was that a model 120B param requires 240GB VRAM (4bit -> 68GB), I threw 4x A100 i.e. 320GB VRAM and this should work. But, that didn't work out. The main reason was that Zero-2 has copies of entire model parameters on each GPU and Pytorch was somehow taking 12GB leading to OOM on 80GB VRAM A100, so throwing in A100s didn't make any difference. Also, Zero-3 (model params sharding) was not an option due to the errors that it was presenting. I noted these OOM errors and will try to keep track of the memory constraints more vigilantly in the future. The axolotl config for this experiment is available below.
1base_model: abideen/Qwen-120B
2model_type: Qwen2ForCausalLM
3tokenizer_type: Qwen2Tokenizer
4load_in_8bit: false
5load_in_4bit: true
6strict: false
7
8datasets:
9 - path: abacusai/SystemChat
10 type: sharegpt
11dataset_prepared_path:
12val_set_size: 0
13output_dir: /workspace/Qwen-120b-systemchat
14resume_from_checkpoint:
15hf_use_auth_token:
16adapter: qlora
17lora_model_dir:
18sequence_len: 2048
19sample_packing: true
20pad_to_sequence_len: true
21lora_r: 16
22lora_alpha: 16
23lora_dropout: 0.05
24lora_target_modules:
25lora_target_linear: true
26lora_fan_in_fan_out:
27lora_modules_to_save:
28 - embed_tokens
29 - lm_head
30wandb_project: Qwen-Systemchat-multiGPU
31wandb_entity:
32wandb_watch:
33wandb_run_id:
34wandb_log_model:
35gradient_accumulation_steps: 1
36micro_batch_size: 1
37num_epochs: 1
38optimizer: paged_adamw_8bit
39lr_scheduler: cosine
40learning_rate: 0.0002
41train_on_inputs:
42group_by_length: false
43bf16: true
44fp16: false
45tf32: false
46gradient_checkpointing: true
47early_stopping_patience:
48local_rank:
49logging_steps: 1
50xformers_attention:
51flash_attention: true
52warmup_steps: 100
53eval_steps:
54save_steps: 2000
55save_total_limit: 2
56eval_sample_packing:
57debug:
58deepspeed:
59weight_decay: 0.05
60fsdp:
61fsdp_config:
62special_tokens:
63 eos_token: "<|im_end|>"
64tokens:
65 - "<|im_start|>"
66trust_remote_code: true
MegaQwen-120B: https://huggingface.co/abideen/MegaQwen-120B

Conclusion

All in all, it was a good experience for me to try out multi-GPU training on the best available open-source models. I tried to work my way through different errors, but some remained unresolved. Will try to solve them in future experiments.
AILLMNLPMachine LearningDeep LearningQueryloop