Queryloop

Research

Exploring S1: Experiments and Findings

Zain ul Abideen
March 12, 2025
15 min read

A detailed analysis of S1, an open-weight language model, with experiments across multiple benchmarks including GPQA, AIME25, and OpenAI Math.

Blog image

Introduction

S1 is an open-weight language model from the SimpleScaling project, designed for efficiency and performance across multiple NLP benchmarks. The model's architecture and training methodology aim to optimize inference speed while maintaining strong accuracy. The official GitHub repository provides insights into its design and capabilities.

To evaluate S1's performance, I conducted experiments on various datasets, including GPQA Diamond OpenAI, AIME25 No Figures, and OpenAI Math, using the vLLM inference framework. The goal was to measure exact match accuracy and observe how different generation strategies impact results.

Experiment Setup and Challenges

I ran the experiments using S1.1–32B (pretrained checkpoint) with float16 precision on a tensor parallel system with eight GPUs. The evaluation was performed using lm_eval with different thinking strategies — Delay, Halt, and Wait — to analyze how they influence response accuracy.
1git clone https://github.com/simplescaling/s1.git
2 cd s1
3 pip install -r requirements.txt
4 cd eval/lm-evaluation-harness
5 pip install -e .[math,vllm]

Inference Process

As part of the inference setup, I:
  • Installed vLLM and transformers for model execution.
  • Loaded S1.1–32B with a tensor parallel size of 2.
  • Utilized max token limits of 32K to ensure reasoning space.
  • Used custom sampling strategies, adjusting stop token handling and temperature settings.
  • Implemented different synonyms for thinking pauses such as Wait, Pause, Hold, and Suspend to evaluate the model's response control.
Challenges faced:
  • Computational Demand: Running a 32B model required significant GPU memory and processing time (~30GB VRAM per run).
  • Inference Speed: The model exhibited variable latency across tasks, requiring optimization in batch sizes and execution parallelism.
  • Output Consistency: Adjusting token handling was necessary to prevent incomplete or overly verbose responses.

Evaluation Results

Here's a summary of the exact match scores across different datasets and strategies:

GPQA Diamond OpenAI

  • Delay: 0.5758
1Saving per-sample results for: gpqa_diamond_openai
2 vllm (pretrained=simplescaling/s1.1-32B,dtype=float16,tensor_parallel_size=8), gen_kwargs: (max_gen_toks=2048,max_tokens_thinking=1400,thinking_n_ignore=1,thinking_n_ignore_str=Delay), limit: None, num_fewshot: None, batch_size: auto
3 | Tasks |Version|Filter|n-shot| Metric | | Value | |Stderr|
4 |-------------------|------:|------|-----:|-----------------|---|------:|---|------|
5 |gpqa_diamond_openai| 1|none | 0|exact_match |↑ | 0.5758|± | N/A|
6 | | |none | 0|extracted_answers|↑ |-1.0000|± | N/A|
  • Halt: 0.5808
1Saving per-sample results for: gpqa_diamond_openai
2 vllm (pretrained=simplescaling/s1.1-32B,dtype=float16,tensor_parallel_size=8), gen_kwargs: (max_gen_toks=2048,max_tokens_thinking=1400,thinking_n_ignore=1,thinking_n_ignore_str=Halt), limit: None, num_fewshot: None, batch_size: auto
3 | Tasks |Version|Filter|n-shot| Metric | | Value | |Stderr|
4 |-------------------|------:|------|-----:|-----------------|---|------:|---|------|
5 |gpqa_diamond_openai| 1|none | 0|exact_match |↑ | 0.5808|± | N/A|
6 | | |none | 0|extracted_answers|↑ |-1.0000|± | N/A|
  • Wait: 0.5808
1Saving per-sample results for: gpqa_diamond_openai
2 vllm (pretrained=simplescaling/s1.1-32B,dtype=float16,tensor_parallel_size=8), gen_kwargs: (max_gen_toks=2048,max_tokens_thinking=1400,thinking_n_ignore=1,thinking_n_ignore_str=Wait), limit: None, num_fewshot: None, batch_size: auto
3 | Tasks |Version|Filter|n-shot| Metric | | Value | |Stderr|
4 |-------------------|------:|------|-----:|-----------------|---|------:|---|------|
5 |gpqa_diamond_openai| 1|none | 0|exact_match |↑ | 0.5808|± | N/A|
6 | | |none | 0|extracted_answers|↑ |-1.0000|± | N/A|

AIME25 No Figures

  • Delay: 0.2667
1Saving per-sample results for: aime25_nofigures
2 vllm (pretrained=simplescaling/s1.1-32B,dtype=float16,tensor_parallel_size=8), gen_kwargs: (max_gen_toks=2048,max_tokens_thinking=auto,thinking_n_ignore=1,thinking_n_ignore_str=Delay), limit: None, num_fewshot: None, batch_size: auto
3 | Tasks |Version|Filter|n-shot| Metric | | Value | |Stderr|
4 |----------------|------:|------|-----:|-----------------|---|------:|---|------|
5 |aime25_nofigures| 1|none | 0|exact_match |↑ | 0.2667|± | N/A|
6 | | |none | 0|extracted_answers|↑ |-1.0000|± | N/A|
  • Halt: 0.2667
1Saving per-sample results for: aime25_nofigures
2 vllm (pretrained=simplescaling/s1.1-32B,dtype=float16,tensor_parallel_size=8), gen_kwargs: (max_gen_toks=2048,max_tokens_thinking=auto,thinking_n_ignore=1,thinking_n_ignore_str=Halt), limit: None, num_fewshot: None, batch_size: auto
3 | Tasks |Version|Filter|n-shot| Metric | | Value | |Stderr|
4 |----------------|------:|------|-----:|-----------------|---|------:|---|------|
5 |aime25_nofigures| 1|none | 0|exact_match |↑ | 0.2667|± | N/A|
6 | | |none | 0|extracted_answers|↑ |-1.0000|± | N/A|
  • Wait: 0.2667
1Saving per-sample results for: aime25_nofigures
2 vllm (pretrained=simplescaling/s1.1-32B,dtype=float16,tensor_parallel_size=8), gen_kwargs: (max_gen_toks=2048,max_tokens_thinking=auto,thinking_n_ignore=1,thinking_n_ignore_str=Wait), limit: None, num_fewshot: None, batch_size: auto
3 | Tasks |Version|Filter|n-shot| Metric | | Value | |Stderr|
4 |----------------|------:|------|-----:|-----------------|---|------:|---|------|
5 |aime25_nofigures| 1|none | 0|exact_match |↑ | 0.2667|± | N/A|
6 | | |none | 0|extracted_answers|↑ |-1.0000|± | N/A|
  • (With 16000 token limit): 0.4
1Saving per-sample results for: aime25_nofigures
2 vllm (pretrained=simplescaling/s1.1-32B,dtype=float16,tensor_parallel_size=8), gen_kwargs: (max_gen_toks=2048,max_tokens_thinking=auto,thinking_n_ignore=1,thinking_n_ignore_str=Wait), limit: None, num_fewshot: None, batch_size: auto
3 | Tasks |Version|Filter|n-shot| Metric | | Value | |Stderr|
4 |----------------|------:|------|-----:|-----------------|---|------:|---|------|
5 |aime25_nofigures| 1|none | 0|exact_match |↑ | 0.2667|± | N/A|
6 | | |none | 0|extracted_answers|↑ |-1.0000|± | N/A|

OpenAI Math

  • Delay: 0.818
1Saving per-sample results for: openai_math vllm (pretrained=simplescaling/s1.1-32B,dtype=float16,tensor_parallel_size=8), gen_kwargs: (max_gen_toks=2048,max_tokens_thinking=auto,thinking_n_ignore=1,thinking_n_ignore_str=Delay), limit: None, num_fewshot: None, batch_size: auto
2 | Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr|
3 |-----------|------:|------|-----:|-----------------|---|-----:|---|------|
4 |openai_math| 1|none | 0|exact_match |↑ | 0.818|± | N/A|
5 | | |none | 0|extracted_answers|↑ |-1.000|± | N/A|
  • Halt: 0.816
1Saving per-sample results for: openai_math
2 vllm (pretrained=simplescaling/s1.1-32B,dtype=float16,tensor_parallel_size=8), gen_kwargs: (max_gen_toks=2048,max_tokens_thinking=auto,thinking_n_ignore=1,thinking_n_ignore_str=Halt), limit: None, num_fewshot: None, batch_size: auto
3 | Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr|
4 |-----------|------:|------|-----:|-----------------|---|-----:|---|------|
5 |openai_math| 1|none | 0|exact_match |↑ | 0.816|± | N/A|
6 | | |none | 0|extracted_answers|↑ |-1.000|± | N/A|
  • Wait: 0.816
1Saving per-sample results for: openai_math
2 vllm (pretrained=simplescaling/s1.1-32B,dtype=float16,tensor_parallel_size=8), gen_kwargs: (max_gen_toks=2048,max_tokens_thinking=auto,thinking_n_ignore=1,thinking_n_ignore_str=Wait), limit: None, num_fewshot: None, batch_size: auto
3 | Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr|
4 |-----------|------:|------|-----:|-----------------|---|-----:|---|------|
5 |openai_math| 1|none | 0|exact_match |↑ | 0.816|± | N/A|
6 | | |none | 0|extracted_answers|↑ |-1.000|± | N/A|

Observations

  • For GPQA, Wait and Halt performed slightly better than Delay, but differences were minimal.
  • AIME25 had consistently low scores (~0.27), but increasing the max token limit to 16k improved it to 0.4.
  • OpenAI Math yielded the highest accuracy (~0.82), with minimal variation across thinking strategies.

Future Directions

  • Optimization: Investigate dynamic token allocation to reduce computational overhead while maintaining accuracy.
  • Fine-Tuning: Apply domain-specific fine-tuning to boost AIME25 performance.
  • Enhanced Strategy Testing: Compare additional prompting techniques (e.g., chain-of-thought, few-shot learning) to analyze impact.
  • Improve Inference Speed: Optimize CUDA parallelism and reduce memory bottlenecks by fine-tuning batch execution.
AILanguage ModelsSimpleScalingS1 ModelBenchmarkingMachine LearningNLPQueryloop