MathReasoner-Mini-1.5b
π¨ We recommend using this model for High school level math problems. It works better to ask the question in English. We do not advise using it for other tasks, as this is an experimental release aimed at exploring the reasoning capabilities of small models.
π Colab notebook for inference
Introduction
This is a reasoning model trained on top of Qwen2.5-Math-1.5B-base and has been trained on Three stages (SFT, DPO and GRPO), to progressively improve mathematical reasoning with structured outputs on GSM8K dataset, a benchmark targeting school level math problems.
Evaluation (GSM8K Pass@1 Zero shot)
| ModelPass@1 | Acuuracy % |
|---|---|
| Base Qwen2.5-Math-1.5B | 54% |
| After SFT | 67.5% |
| After SFT + DPO | 70% |
| After SFT + DPO + GRPO (MathReasoner-Mini-1.5b) | ~82.1% |
Evaluation was run on GSM8K test with: temperature=0.3, top_p=1.0,
MathReasoner's pass@8 accuracy is 94.1% showing that there still more improvement possible on scaling RL
Accuracies shown above take structured output format in consideration, requiring reasoning to be enclosed within think tags and numerical answer between answer tags
Training Stages
Stage 1 β Supervised Fine-Tuning (SFT)
Checkpoint: arubittu/Qwen-2.5_1.5b_MATH_GSM8K_SFT10
- Dataset: curated GSM8K subset with self-verified generations
- Epochs: 10
- LR: 3e-6
- Batch size: 4
- Gradient accumulation: 4
- Only correct & wellβformatted CoT samples used to minimize model entropy
Stage 2 β Direct Preference Optimization (DPO)
Checkpoint: arubittu/Qwen-2.5_1.5b_MATH_GSM8K_SFT10_DPO3
- Dataset: ~1,000 preference pairs
- Mostly hard pairs (correct vs incorrect)
- Some soft preferences (shorter correct CoT)
- For each GSM8K problem, 4 samples were generated β chosen = correct, rejected = incorrect
- Epochs: 3
- Ξ² = 0.1, LR = 3eβ6
Stage 3 β GRPO Reinforcement Learning
This model was further trained with GRPO on GSM8K train split.
- Steps: 400
- Loss type: DAPO
- Rollouts per prompt: 4
- Gradient accumulation : 8
- Custom reward: format strictness + correctness
- vLLM enabled rollout, TRL trainer
Prompt Template
def prompt_input(question):
prompt = f'''A conversation between User and Assistant. The User asks a question, and the Assistant solves it. The Assistant first thinks about the reasoning process in the mind and then provides the User with the answer. The reasoning process is enclosed within <think> </think> and answer is enclosed within <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think> <answer> answer here </answer>.
User: {question}
Assistant: <think>'''
return prompt
Loading model in HuggingFace
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model = "arubittu/MathReasoner-Lite-1.5b"
tokenizer = AutoTokenizer.from_pretrained(model)
model = AutoModelForCausalLM.from_pretrained(
model,
torch_dtype=torch.bfloat16,
device_map="auto",
)
Scope for Improvement
- Train with more diverse math data leading to better generalization. Currently all the three stages were trained using the entire or subsets of GSM8K trainset. Including diverse datasets will lead to further improvement and the model will be able to solve harder problems. This path can be pursued in the future if more compute resources are available
- Optimizing current training by using curriculum learning with temperature and rollout scaling especially in RL phase, as done by PhysicsWallahAI/Aryabhata-1.0
- Exploring methods like Model merging and on-policy Distillation from larger models
Contact
If you find issues or want improvements, feel free to open an issue or discussion on the Hugging Face page.
- Downloads last month
- 36
Model tree for arubittu/MathReasoner-Mini-1.5b
Base model
Qwen/Qwen2.5-1.5B