Blog - Queryloop | Queryloop

Zain ul Abideen

July 7, 2024

6 min read

Research

Align Phi3 with CPO-SimPO

Align your LLM with less memory and speed efficient approach than DPO

Aligning LLMs for optimal performance typically starts with Supervised Fine-Tuning (SFT). Commonly, the model is loaded in 4-Bit, and config for LoRA training is applied. The standard practice involves loading the model in 4-bit mode and applying configurations for LoRA (Low-Rank Adaptation) training. Direct Preference Optimization (DPO) is another prominent technique for optimizing models with lower costs. The standard practice involves coupling SFT+DPO to further improve model performance but can be costly. Odds Ratio Preference Optimization (ORPO) replaces the SFT+DPO into a single step with more enhanced performance by adding an odds ratio-based penalty to the conventional negative log-likelihood (NLL) loss for differentiating the generation styles between favored and disfavored responses. Another technique for more stable training and improved performance is CPO-SimPO. It aims to counter SFT's dependency on training data quality for model performance, DPO's memory + speed inefficiency (if dealing with both parametrized and reference policy) and to prevent the generation of long but low-quality sequences. In this blog, I will introduce this technique in detail and further train Phi3-Mini-4K-Instruct on CPO-SimPO.

Machine Learning

Deep Learning

Optimization

CPO

SimPO

Queryloop

Latest from Queryloop

Why Building Production-Grade RAG Applications Is So Hard

Align Phi3 with CPO-SimPO