MathReasoner-Mini-1.5b

🚨 We recommend using this model for High school level math problems. It works better to ask the question in English. We do not advise using it for other tasks, as this is an experimental release aimed at exploring the reasoning capabilities of small models.

πŸ“ Colab notebook for inference

Introduction

This is a reasoning model trained on top of Qwen2.5-Math-1.5B-base and has been trained on Three stages (SFT, DPO and GRPO), to progressively improve mathematical reasoning with structured outputs on GSM8K dataset, a benchmark targeting school level math problems.

Evaluation (GSM8K Pass@1 Zero shot)

ModelPass@1 Acuuracy %
Base Qwen2.5-Math-1.5B 54%
After SFT 67.5%
After SFT + DPO 70%
After SFT + DPO + GRPO (MathReasoner-Mini-1.5b) ~82.1%

Evaluation was run on GSM8K test with: temperature=0.3, top_p=1.0,

MathReasoner's pass@8 accuracy is 94.1% showing that there still more improvement possible on scaling RL

Accuracies shown above take structured output format in consideration, requiring reasoning to be enclosed within think tags and numerical answer between answer tags

Training Stages

Stage 1 β€” Supervised Fine-Tuning (SFT)

Checkpoint: arubittu/Qwen-2.5_1.5b_MATH_GSM8K_SFT10

  • Dataset: curated GSM8K subset with self-verified generations
  • Epochs: 10
  • LR: 3e-6
  • Batch size: 4
  • Gradient accumulation: 4
  • Only correct & well‑formatted CoT samples used to minimize model entropy

Stage 2 β€” Direct Preference Optimization (DPO)

Checkpoint: arubittu/Qwen-2.5_1.5b_MATH_GSM8K_SFT10_DPO3

  • Dataset: ~1,000 preference pairs
  • Mostly hard pairs (correct vs incorrect)
  • Some soft preferences (shorter correct CoT)
  • For each GSM8K problem, 4 samples were generated β†’ chosen = correct, rejected = incorrect
  • Epochs: 3
  • Ξ² = 0.1, LR = 3e‑6

Stage 3 β€” GRPO Reinforcement Learning

This model was further trained with GRPO on GSM8K train split.

  • Steps: 400
  • Loss type: DAPO
  • Rollouts per prompt: 4
  • Gradient accumulation : 8
  • Custom reward: format strictness + correctness
  • vLLM enabled rollout, TRL trainer

Prompt Template

def prompt_input(question):
  prompt = f'''A conversation between User and Assistant. The User asks a question, and the Assistant solves it. The Assistant first thinks about the reasoning process in the mind and then provides the User with the answer. The reasoning process is enclosed within <think> </think> and answer is enclosed within <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think> <answer> answer here </answer>.
  User: {question}
  Assistant: <think>'''
  return prompt

Loading model in HuggingFace

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model = "arubittu/MathReasoner-Lite-1.5b"

tokenizer = AutoTokenizer.from_pretrained(model)
model = AutoModelForCausalLM.from_pretrained(
    model,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

Scope for Improvement

  • Train with more diverse math data leading to better generalization. Currently all the three stages were trained using the entire or subsets of GSM8K trainset. Including diverse datasets will lead to further improvement and the model will be able to solve harder problems. This path can be pursued in the future if more compute resources are available
  • Optimizing current training by using curriculum learning with temperature and rollout scaling especially in RL phase, as done by PhysicsWallahAI/Aryabhata-1.0
  • Exploring methods like Model merging and on-policy Distillation from larger models

Contact

If you find issues or want improvements, feel free to open an issue or discussion on the Hugging Face page.

Downloads last month
36
Safetensors
Model size
2B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for arubittu/MathReasoner-Mini-1.5b

Dataset used to train arubittu/MathReasoner-Mini-1.5b