ExGRPO-Qwen2.5-Math-1.5B-Zero: Learning to Reason from Experience

This model, ExGRPO-Qwen2.5-Math-1.5B-Zero, is a specific checkpoint from the ExGRPO framework, presented in the paper ExGRPO: Learning to Reason from Experience. This particular model checkpoint uses Qwen2.5-Math-1.5B as its backbone model.

Introduction

Reinforcement learning from verifiable rewards (RLVR) is an emerging paradigm for improving the reasoning ability of large language models. However, standard on-policy training discards rollout experiences after a single update, leading to computational inefficiency and instability. While prior work on RL has highlighted the benefits of reusing past experience, the role of experience characteristics in shaping learning dynamics of large reasoning models remains underexplored. In this paper, we are the first to investigate what makes a reasoning experience valuable and identify rollout correctness and entropy as effective indicators of experience value. Based on these insights, we propose ExGRPO (Experiential Group Relative Policy Optimization), a framework that organizes and prioritizes valuable experiences, and employs a mixed-policy objective to balance exploration with experience exploitation. Experiments on five backbone models (1.5B-8B parameters) show that ExGRPO consistently improves reasoning performance on mathematical/general benchmarks, with an average gain of +3.5/7.6 points over on-policy RLVR. Moreover, ExGRPO stabilizes training on both stronger and weaker models where on-policy methods fail. These results highlight principled experience management as a key ingredient for efficient and scalable RLVR.

Key Highlights from the paper:

Experience Value Modeling: Introduces the online proxy metrics: rollout correctness and trajectory entropy, for quantifying the value of RLVR experience.
ExGRPO Framework: Built on top of GRPO, ExGRPO introduces a systematic experience management mechanism and an experience optimization objective to maximize the benefit of past explorations.
Generalization and Stability: Demonstrates broad applicability across different backbone models and mitigates training collapse of on-policy RLVR in challenging scenarios.

For further details on the ExGRPO framework, training procedures, and comprehensive evaluation results, please refer to the official GitHub repository.

Sample Usage

You can use this model with the Hugging Face transformers library. Below is an example demonstrating how to load the model and generate a response for a mathematical reasoning problem, utilizing its specific chat template for structured output.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load the model and tokenizer from the Hugging Face Hub
model_name = "rzzhan/ExGRPO-Qwen2.5-Math-1.5B-Zero"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16, # Use torch.float16 if bfloat16 is not supported on your GPU
    device_map="auto"
)

# Define the conversation using the model's chat template structure
# The template guides the model to follow a reasoning process (Thought -> Solution).
messages = [
    {"role": "user", "content": "What is 15 plus 7, and then subtract 3?"},
]

# Apply the chat template to format the prompt as expected by the model
text_input = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

# Tokenize the input and move to the model's device
model_inputs = tokenizer([text_input], return_tensors="pt").to(model.device)

# Generate the output
generated_ids = model.generate(
    model_inputs.input_ids,
    max_new_tokens=100, # Adjust as needed for longer explanations
    do_sample=True,
    temperature=0.7,
    top_p=0.9,
    eos_token_id=tokenizer.eos_token_id # Ensure generation stops at EOS token
)

# Decode and print the generated text
decoded_output = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
print(decoded_output)

Expected output structure (content may vary due to do_sample=True):

User: This is the problem:
What is 15 plus 7, and then subtract 3?
Assistant: <think>
Here's my thought process to solve the problem:

1.  **Analyze the problem**: The problem asks for a sequence of arithmetic operations: addition and subtraction.

2.  **Break down the problem**:
    *   First operation: "15 plus 7"
    *   Second operation: "then subtract 3"

3.  **Perform the first operation**: 15 + 7
    *   15 + 7 = 22

4.  **Perform the second operation**: 22 - 3
    *   22 - 3 = 19

5.  **Final Answer**: The result is 19.
</think>
Solution:
The answer is 19.

(Note: The actual output content, especially the Solution part, might differ from the thought process unless specific stop tokens are used for multi-stage generation, as implied by the chat template structure for detailed reasoning.)

Evaluation Results

The paper presents comprehensive evaluation across various benchmarks. Below are key results showcasing ExGRPO's performance from the original GitHub repository:

Zero RLVR on Qwen2.5-Math-7B & Continual RLVR on LUFFY

Zero RLVR on Llama3.1-8B (Base, Instruct), Qwen2.5-Math 1.5B Base, Qwen2.5-7B Instruct

Click to view full results of model extension

Citation

If you find our model, data, or evaluation code useful, please kindly cite our paper:

@article{zhan2025exgrpo,
      title={ExGRPO: Learning to Reason from Experience},
      author={Runzhe Zhan and Yafu Li and Zhi Wang and Xiaoye Qu and Dongrui Liu and Jing Shao and Derek F. Wong and Yu Cheng},
      year={2025},
      journal = {ArXiv preprint},
      volume = {2510.02245},
      url={https://arxiv.org/abs/2510.02245},
}

Downloads last month: 32

Safetensors

Model size

2B params

Tensor type

F32

Collection including rzzhan/ExGRPO-Qwen2.5-Math-1.5B-Zero

ExGRPO

Collection

Model collections trained using ExGRPO. • 7 items • Updated Oct 3 • 1