Align Phi3 with CPO-SimPO
Align your LLM with less memory and speed efficient approach than DPO
Aligning LLMs for optimal performance typically starts with Supervised Fine-Tuning (SFT). Commonly, the model is loaded in 4-Bit, and config for LoRA training is applied. The standard practice involves loading the model in 4-bit mode and applying configurations for LoRA (Low-Rank Adaptation) training. Direct Preference Optimization (DPO) is another prominent technique for optimizing models with lower costs. The standard practice involves coupling SFT+DPO to further improve model performance but can be costly. Odds Ratio Preference Optimization (ORPO) replaces the SFT+DPO into a single step with more enhanced performance by adding an odds ratio-based penalty to the conventional negative log-likelihood (NLL) loss for differentiating the generation styles between favored and disfavored responses. Another technique for more stable training and improved performance is CPO-SimPO. It aims to counter SFT's dependency on training data quality for model performance, DPO's memory + speed inefficiency (if dealing with both parametrized and reference policy) and to prevent the generation of long but low-quality sequences. In this blog, I will introduce this technique in detail and further train Phi3-Mini-4K-Instruct on CPO-SimPO.
What is CPO-SimPO?
Contrastive Preference Optimization — CPO
Simple Preference Optimization — SimPO
CPO-SimPO
CPO-SimPO — Objective
Setting up dependencies
- mkdir -p ~/miniconda3
- wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh
- bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
- rm -rf ~/miniconda3/miniconda.sh
- ~/miniconda3/bin/conda init bash
- ~/miniconda3/bin/conda init zsh