MathReasoner-Mini-1.5b

🚨 We recommend using this model for High school level math problems. It works better to ask the question in English. We do not advise using it for other tasks, as this is an experimental release aimed at exploring the reasoning capabilities of small models.

📁 Colab notebook for inference

Introduction

This is a reasoning model trained on top of Qwen2.5-Math-1.5B-base and has been trained on Three stages (SFT, DPO and GRPO), to progressively improve mathematical reasoning with structured outputs on GSM8K dataset, a benchmark targeting school level math problems.

Evaluation (GSM8K Pass@1 Zero shot)

ModelPass@1	Acuuracy %
Base Qwen2.5-Math-1.5B	54%
After SFT	67.5%
After SFT + DPO	70%
After SFT + DPO + GRPO (MathReasoner-Mini-1.5b)	~82.1%

Evaluation was run on GSM8K test with: temperature=0.3, top_p=1.0,

MathReasoner's pass@8 accuracy is 94.1% showing that there still more improvement possible on scaling RL

Accuracies shown above take structured output format in consideration, requiring reasoning to be enclosed within think tags and numerical answer between answer tags

Training Stages

Stage 1 — Supervised Fine-Tuning (SFT)

Checkpoint: arubittu/Qwen-2.5_1.5b_MATH_GSM8K_SFT10

Dataset: curated GSM8K subset with self-verified generations
Epochs: 10
LR: 3e-6
Batch size: 4
Gradient accumulation: 4
Only correct & well‑formatted CoT samples used to minimize model entropy

Stage 2 — Direct Preference Optimization (DPO)

Checkpoint: arubittu/Qwen-2.5_1.5b_MATH_GSM8K_SFT10_DPO3

Dataset: ~1,000 preference pairs
Mostly hard pairs (correct vs incorrect)
Some soft preferences (shorter correct CoT)
For each GSM8K problem, 4 samples were generated → chosen = correct, rejected = incorrect
Epochs: 3
β = 0.1, LR = 3e‑6

Stage 3 — GRPO Reinforcement Learning

This model was further trained with GRPO on GSM8K train split.

Steps: 400
Loss type: DAPO
Rollouts per prompt: 4
Gradient accumulation : 8
Custom reward: format strictness + correctness
vLLM enabled rollout, TRL trainer

Prompt Template

def prompt_input(question):
  prompt = f'''A conversation between User and Assistant. The User asks a question, and the Assistant solves it. The Assistant first thinks about the reasoning process in the mind and then provides the User with the answer. The reasoning process is enclosed within <think> </think> and answer is enclosed within <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think> <answer> answer here </answer>.
  User: {question}
  Assistant: <think>'''
  return prompt

Loading model in HuggingFace

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model = "arubittu/MathReasoner-Lite-1.5b"

tokenizer = AutoTokenizer.from_pretrained(model)
model = AutoModelForCausalLM.from_pretrained(
    model,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

Scope for Improvement

Train with more diverse math data leading to better generalization. Currently all the three stages were trained using the entire or subsets of GSM8K trainset. Including diverse datasets will lead to further improvement and the model will be able to solve harder problems. This path can be pursued in the future if more compute resources are available
Optimizing current training by using curriculum learning with temperature and rollout scaling especially in RL phase, as done by PhysicsWallahAI/Aryabhata-1.0
Exploring methods like Model merging and on-policy Distillation from larger models

Contact

If you find issues or want improvements, feel free to open an issue or discussion on the Hugging Face page.

Downloads last month: 36

Safetensors

Model size

2B params

Tensor type

F32

Model tree for arubittu/MathReasoner-Mini-1.5b

Base model

Qwen/Qwen2.5-1.5B

Finetuned

Qwen/Qwen2.5-Math-1.5B

Finetuned

arubittu/Qwen-2.5_1.5b_MATH_GSM8K_SFT10

Finetuned

arubittu/Qwen-2.5_1.5b_MATH_GSM8K_SFT10_DPO3

Finetuned

(1)

this model

Quantizations

1 model

arubittu
/

MathReasoner-Mini-1.5b

MathReasoner-Mini-1.5b

Introduction

Evaluation (GSM8K Pass@1 Zero shot)

Training Stages

Stage 1 — Supervised Fine-Tuning (SFT)

Stage 2 — Direct Preference Optimization (DPO)

Stage 3 — GRPO Reinforcement Learning

Prompt Template

Loading model in HuggingFace

Scope for Improvement

Contact

Model tree for arubittu/MathReasoner-Mini-1.5b

Dataset used to train arubittu/MathReasoner-Mini-1.5b