Exploring S1: Experiments and Findings
A detailed analysis of S1, an open-weight language model, with experiments across multiple benchmarks including GPQA, AIME25, and OpenAI Math.
Stay updated with our latest research findings, product developments, and insights into AI optimization
A detailed analysis of S1, an open-weight language model, with experiments across multiple benchmarks including GPQA, AIME25, and OpenAI Math.
Comparison of various positional embeddings.
When processing sequences such as text, the ordering information is clearly critical. To incorporate ordering information and rather not treat sequences as sets, encoding position information is vital. Positional encoding achieves this by assigning an embedding vector to each position and adding that to the corresponding token representations. There have been many Positional Encoding techniques introduced: Absolute-PE counts tokens from the start of a sequence, Relative-PE counts backward starting at the current token. We will discuss some of the more advanced position encoding methods: RoPE and its variants (Linear, NTK, YaRN), CoPE.
Comparison of Deepseek's new Multi-latent head attention with MHA, MQA, and GQA.
In Transformer decoders, since the attention of tokens is dependent on the preceding tokens, so instead of recalculating the previous context, its Keys and Values are cached. This can significantly speed up the inference but may impose expensive memory overhead as the sequence length and the model dimensions grow. In this context, multiple attention mechanisms have been introduced: Multi-Head Attention, Multi-Query Attention, Grouped-Query Attention, and Multi-Head Latent Attention.
Align your LLM with less memory and speed efficient approach than DPO
Aligning LLMs for optimal performance typically starts with Supervised Fine-Tuning (SFT). Commonly, the model is loaded in 4-Bit, and config for LoRA training is applied. The standard practice involves loading the model in 4-bit mode and applying configurations for LoRA (Low-Rank Adaptation) training. Direct Preference Optimization (DPO) is another prominent technique for optimizing models with lower costs. The standard practice involves coupling SFT+DPO to further improve model performance but can be costly. Odds Ratio Preference Optimization (ORPO) replaces the SFT+DPO into a single step with more enhanced performance by adding an odds ratio-based penalty to the conventional negative log-likelihood (NLL) loss for differentiating the generation styles between favored and disfavored responses. Another technique for more stable training and improved performance is CPO-SimPO. It aims to counter SFT's dependency on training data quality for model performance, DPO's memory + speed inefficiency (if dealing with both parametrized and reference policy) and to prevent the generation of long but low-quality sequences. In this blog, I will introduce this technique in detail and further train Phi3-Mini-4K-Instruct on CPO-SimPO.
Benchmarking various LLM Inference Engines.
LLMs excel in text generation applications, such as chat and code completion models capable of high understanding and fluency. However, their large size also creates challenges for inference. Basic inference is slow because LLMs generate text tokens by token, requiring repeated calls for each next token. As the input sequence grows, the processing time increases. Additionally, LLMs have billions of parameters, making it difficult to store and manage all those weights in memory. In the effort to optimize LLM inference and serving, there are multiple frameworks and packages and in this blog, I'll use and compare the following inference engines TensorRT-LLM vLLM LMDeploy MLC-LLM
A comprehensive explanation of Andrej Karpathy's Micrograd implementation with mathematical concepts and object-oriented programming.
Neural Networks: Zero to Hero by Andrej Karpathy focuses on building neural networks from scratch, starting with the basics of backpropagation and advancing to modern deep neural networks like GPT.
Training 3 Llama models for comparison of Cosine Scheduled and Schedule-Free optimizer.
In the realm of machine learning, we are continuously relying on the intricate algorithms and techniques to train our models effectively.
What is 1 bit LLM and How to train 70M Llama-Bitnet?
Vanilla LLMs built upon the Transformer architecture typically operate in 16-bit precision (FP-16 or BF-16) and hence the major computation costs account for the floating point matrix addition and multiplication operations...
Train Phi-2 with ORPO with LazyOrpo
Before jumping into ORPO, I am going to assume that you are well-acquainted with the process of fine-tuning LLMs for optimal performance. One of the most common technique used for fine-tuning is the Supervised Fine-Tuning (SFT)...
Train 70–120B LLM on 4xA100s and 2xRTX3090s (Consumer-grade GPUs)
I have been working with bigger models like Mixtral 8x7B, Qwen-120B, and Miqu-70B recently. But the most important thing when playing with bigger models is the amount of compute resources they require during training...
Also releasing Gemma-7B-Openhermes and Gemma-2B-Openhermes
Google has been in the LLM space for quite some time now, yet Gemma remains their first open LLM. The release of Gemma has stirred the community and everyone is excited to try it out. Like everyone, I am no exception...
Benchmarking Emotional intelligence evaluation, Code Generation, Text summarization, and Narrative composition.
Small Language Models (SLMs) have been the talk of the town for some time now. Different models are being released almost everyday with the focus to achieve on par results with Large Language Models (LLMs). However, in terms of computational and memory cost, SLMs are already ahead...