Queryloop

Research

Align Phi3 with CPO-SimPO

Zain ul Abideen
July 7, 2024
6 min read

Align your LLM with less memory and speed efficient approach than DPO

Aligning LLMs for optimal performance typically starts with Supervised Fine-Tuning (SFT). Commonly, the model is loaded in 4-Bit, and config for LoRA training is applied. The standard practice involves loading the model in 4-bit mode and applying configurations for LoRA (Low-Rank Adaptation) training. Direct Preference Optimization (DPO) is another prominent technique for optimizing models with lower costs. The standard practice involves coupling SFT+DPO to further improve model performance but can be costly. Odds Ratio Preference Optimization (ORPO) replaces the SFT+DPO into a single step with more enhanced performance by adding an odds ratio-based penalty to the conventional negative log-likelihood (NLL) loss for differentiating the generation styles between favored and disfavored responses. Another technique for more stable training and improved performance is CPO-SimPO. It aims to counter SFT's dependency on training data quality for model performance, DPO's memory + speed inefficiency (if dealing with both parametrized and reference policy) and to prevent the generation of long but low-quality sequences. In this blog, I will introduce this technique in detail and further train Phi3-Mini-4K-Instruct on CPO-SimPO.

What is CPO-SimPO?

It is a joint of two preference optimization methods: CPO and SimPO.

Contrastive Preference Optimization — CPO

Introduced by Haoran Xu et. al, 2024, the CPO objective is an approximation of the DPO objective by discarding the ideal policy in the original DPO loss. Also, a behavior cloning (BC) regularizer is incorporated to ensure the model doesn't deviate from the preferred data distribution.
CPO requires a high-quality but flawless preference dataset (format: prompt, chosen, rejected) to achieve perfection in model output and mitigate even minor flaws.

Simple Preference Optimization — SimPO

Introduced by Yu Meng et. al, 2024, SimPO eliminates the need for a reference model in contrast to regular DPO, by a length-normalized reward which is the average log probability of all tokens generated by the main policy model itself, instead of an explicit reward model in DPO. Secondly, it introduces a target reward margin γ to ensure the reward difference between chosen and rejected responses exceeds this margin.
SimPO is more memory and compute-efficient than DPO for not using an explicit reward model yet prevents generating longer but lower-quality sequences as it outperforms DPO across AlpacaEval2 and ArenaHard.

CPO-SimPO

Combining both objectives leads us to CPO-SimPO loss to utilize the benefits of both preference optimization methods jointly.

CPO-SimPO — Objective

CPO-SimPO Training of Phi3-Mini-4k-Instruct: We can perform the CPO-SimPO training of any HuggingFace model using the official GitHub repository.

Setting up dependencies

We will need to create a Python environment using conda. Follow these steps if you don't have conda installed.
You will need to open a new terminal for the effects to take place. Now Create a python virtual environment.
  • mkdir -p ~/miniconda3
  • wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh
  • bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
  • rm -rf ~/miniconda3/miniconda.sh
  • ~/miniconda3/bin/conda init bash
  • ~/miniconda3/bin/conda init zsh

Install pytorch and dependencies

You need to install pytorch v2.2.2, and other dependencies for the alignment-handbook repo.
1conda install pytorch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2 pytorch-cuda=11.8 -c pytorch -c nvidia

Clone and Install the required repositories

Install the alignment-handbook repo and its dependencies.
1git clone https://github.com/huggingface/alignment-handbook.git
2cd ./alignment-handbook/
3python -m pip install .
4cd ..

Flash Attention 2 Installation

You will also need Flash Attention 2 installed.
1python -m pip install flash-attn --no-build-isolation

Clone the CPO_SimPO repository

Clone the CPO-SimPO repository to start the training.
1git clone https://github.com/fe1ixxu/CPO_SIMPO.git
2cd CPO_SIMPO

Training configurations

Create a .yaml config file to specify the training arguments. Adjust `per_device_train_batch_size` and `max_length` according to your GPU specifications.
Make sure to set `loss_type: simpo` and `cpo_alpha` to a non-zero value.

Accelerate Configuration

Next, specify the hardware configuration. Use the `deepspeed_zero3.yaml` config in the `accelerate_configs` directory.
Choose `num_processes` as the number of GPUs you have available. A100 GPUs are recommended to avoid CUDA errors.

Start Training

Once everything is set up, provide the paths to the training and accelerate config files and start training.
1ACCELERATE_LOG_LEVEL=info accelerate launch --config_file accelerate_configs/deepspeed_zero3.yaml scripts/run_cpo.py training_configs/phi3-mini4k-instruct-cpo-simpo.yaml

Inference

After training, you can perform inference using the following code:
1import torch
2from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
3
4model = AutoModelForCausalLM.from_pretrained(
5 "abideen/Phi-3-mini-4K-instruct-cpo-simpo",
6 device_map="cuda",
7 torch_dtype="auto",
8 trust_remote_code=True,
9)
10
11tokenizer = AutoTokenizer.from_pretrained("abideen/Phi-3-mini-4K-instruct-cpo-simpo")
12pipe = pipeline(
13 "text-generation",
14 model=model,
15 tokenizer=tokenizer,
16)
17
18output = pipe([{
19 "role": "user", "content": "Can you provide ways to eat combinations of bananas and dragonfruits?"}], max_new_tokens=500)
20print(output[0]['generated_text'])
AIMachine LearningDeep LearningOptimizationCPOSimPO