Research
ORPO Outperforms SFT+DPO | Train Phi-2 with ORPO
Zain ul Abideen
March 22, 2024
5 min read
Train Phi-2 with ORPO with LazyOrpo
Before jumping into ORPO, I am going to assume that you are well-acquainted with the process of fine-tuning LLMs for optimal performance. One of the most common technique used for fine-tuning is the Supervised Fine-Tuning (SFT)...
Introduction
Before jumping into ORPO, I am going to assume that you are well-acquainted with the process of fine-tuning LLMs for optimal performance. One of the most common technique used for fine-tuning is the Supervised Fine-Tuning (SFT). The most common way for doing SFT is to load the model in 4-bit and apply the config to the model for Lora training. Then we use TRL's SFTTrainer to fine-tune models. That's one way of reaching an optimal LLM. Another technique that has been here for some time now is the DPO (Direct Preference Optimization). For DPO, the dataset should be in a specific format i.e. it should contain a chosen response and a rejected response along with the instruction. DPO has shown great results in aligning the model while requiring less compute for the training process. To further improve the model's performance, recently people have adopted to SFT followed by DPO on the same model. This combination of SFT+DPO has proved to be quite effective but at the same time requires more compute resources.
What if I tell you, there is another better fine-tuning technique that can replace both SFT+DPO and have shown promising results. I am referring to ORPO (Odds Ratio Preference Optimization). The main highlight is its loss function. It incorporates an odds ratio-based penalty to the conventional negative log-likelihood (NLL) loss for differentiating the generation styles between favored and disfavored responses.
Can ORPO redefine how we train and align LLMs for RLHF?
State-of-the-art LLMs followed the process of Base Model โ Supervised Fine-tuning โ RLHF (PPO/DPO). This is very resource-intensive and complex. Odds Ratio Preference Optimization (ORPO) proposes a new method to train LLMs by combining SFT and Alignment into a new objective (loss function), achieving state of the art results. DPO not only reduces the cost of the training but also outperforms the results from first fine-tuning the model and then doing RLHF (DPO) on the fine-tuned version. ORPO does not require a reference model, unlike RLHF and DPO. In that sense, ORPO is computationally more efficient than RLHF and DPO in two perspectives:
- Memory allocation
- Fewer FLOPs per batch.
So, in my opinion the answer to the above question is most probably a 'Yes'. It can certainly influence the way how we train our models in the future or may have an impact on future research work regarding fine-tuning LLMs.
ORPO details
๐ ORPO Outperforms SFT, SFT+DPO on PHI-2, Llama 2, and Mistral ๐ Mistral ORPO achieves 12.20% on AlpacaEval2.0, 66.19% on IFEval, and 7.32 on MT-Bench Zephyr Beta
Results from the ORPO paper are impressive and to test verify the results of this paper, I decided to try it out on Phi-2 with Argilla's dpo-mix-7k dataset. Some results from the paper are shown below.

The reason for choosing Phi-2 is that because it shows an insane amount of improvement on this technique as compare to SFT+DPO.
Training process
- For implementing ORPO, we will require a dataset that is in DPO format i.e. it should have a chosen and rejected responses. Fot this experiment, we will opt for Argilla's dpo-mix-7k preference dataset.
- Make sure the dataset doesn't contain instances where the chosen and rejected responses are the same, or one is empty.
- Select a pre-trained LLM (e.g., Llama-2, Mistral). In this case, I have selected Phi-2 as the base model.
- Train the Base model with ORPO objective on preference dataset
There is no extra SFT step that is directly applied to base model. The model was trained for 1 epoch on 1x A40 for almost 6 hours. The training parameters used are below:
- Learning Rate: 5e-6
- Warmup Steps: 100
- Model Name: microsoft/phi-2
- Data Name: argilla/dpo-mix-7k
- Number of Training Epochs: 1
- Maximum Length of Prompt: 128
- Maximum Length of Response: 2048
- Per Device Training Batch Size: 4
- Per Device Evaluation Batch Size: 4
- Gradient Accumulation Steps: 1
- Number of Processes: 1
The training process has also been logged to Weights and Biases. Some of the graphs are shown below:

LazyORPO
LazyORPO (Automated tool to train your model with ORPO). ORPO is a new technique that replaces SFT+DPO. I gave ORPO a shot with Phi-2 and Argilla dpo-mix-7k yielding Phi2-pro.
Since Odds Ratio Preference Optimization (ORPO) proposes a new method to train LLMs by combining SFT and Alignment into a new objective (loss function), achieving state-of-the-art results, Orpo is not yet included in HF's TRL, so in order to make the training phase much easier, I have made a notebook that automates the entire training process with ORPO. Just mention the base model, dataset, epochs, and learning rate to shoot the training. One thing to notice is that ORPO required more memory VRAM as I was not able to fit an 8B Gemma model on A40 48GB VRAM. So, do your calculations accordingly.
A colab notebook is available for you to try it out. You can access the GPUs using RunPod.

LazyORPO:
https://colab.research.google.com/drive/19ci5XIcJDxDVPY2xC1ftZ5z1kc2ah_rx?usp=sharing
AILLMMachine LearningNLPDeep LearningQueryloop