Improve model card: Add pipeline tag, library, paper & code links, introduction, and installation (#1)
Browse files- Improve model card: Add pipeline tag, library, paper & code links, introduction, and installation (c3cd8dad0f39a875244da17b26050e27f2402009)
Co-authored-by: Niels Rogge <[email protected]>
README.md
CHANGED
|
@@ -1,3 +1,89 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
library_name: transformers
|
| 4 |
+
pipeline_tag: text-generation
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
# ExGRPO-Llama3.1-8B-Zero: Learning to Reason from Experience
|
| 8 |
+
|
| 9 |
+
This repository hosts the `ExGRPO-Llama3.1-8B-Zero` model, a component of the **ExGRPO: Learning to Reason from Experience** framework. This work was presented in the paper:
|
| 10 |
+
|
| 11 |
+
[**ExGRPO: Learning to Reason from Experience**](https://huggingface.co/papers/2510.02245)
|
| 12 |
+
|
| 13 |
+
Reinforcement learning from verifiable rewards (RLVR) is an emerging paradigm for improving the reasoning ability of large language models. ExGRPO (Experiential Group Relative Policy Optimization) addresses the computational inefficiency and instability of standard on-policy training by investigating the value of reasoning experiences and proposing a framework that organizes and prioritizes valuable experiences. It employs a mixed-policy objective to balance exploration with experience exploitation, leading to consistent improvements in reasoning performance on mathematical and general benchmarks.
|
| 14 |
+
|
| 15 |
+
For the official code and further details on the ExGRPO framework, please visit the project's GitHub repository:
|
| 16 |
+
[**GitHub Repository**](https://github.com/ElliottYan/LUFFY/tree/main/ExGRPO)
|
| 17 |
+
|
| 18 |
+
<div align="center">
|
| 19 |
+
<img src="https://github.com/ElliottYan/LUFFY/raw/main/ExGRPO/figures/exgrpo_intro.png" alt="overview" style="width: 88%; height: auto;">
|
| 20 |
+
</div>
|
| 21 |
+
|
| 22 |
+
## Key Highlights
|
| 23 |
+
|
| 24 |
+
ExGRPO introduces significant advancements in RLVR for reasoning tasks:
|
| 25 |
+
|
| 26 |
+
- **Experience Value Modeling**: Introduces online proxy metrics (rollout correctness and trajectory entropy) for quantifying the value of RLVR experience.
|
| 27 |
+
- **ExGRPO Framework**: Built on top of GRPO, ExGRPO introduces a systematic experience management mechanism and an experience optimization objective to maximize the benefit of past explorations.
|
| 28 |
+
- **Generalization and Stability**: Demonstrates broad applicability across different backbone models and mitigates training collapse of on-policy RLVR in challenging scenarios.
|
| 29 |
+
|
| 30 |
+
## Getting Started & Usage
|
| 31 |
+
|
| 32 |
+
This model is designed to be compatible with the Hugging Face `transformers` library. A default sample usage snippet will be automatically displayed on the Hugging Face Hub page.
|
| 33 |
+
|
| 34 |
+
To set up the environment and explore the project, follow the installation instructions from the [GitHub repository](https://github.com/ElliottYan/LUFFY/tree/main/ExGRPO#getting-started):
|
| 35 |
+
|
| 36 |
+
### Installation
|
| 37 |
+
|
| 38 |
+
You can install dependencies by running the following commands:
|
| 39 |
+
```bash
|
| 40 |
+
conda create -n exgrpo python=3.10
|
| 41 |
+
conda activate exgrpo
|
| 42 |
+
cd exgrpo
|
| 43 |
+
pip install -r requirements.txt
|
| 44 |
+
pip install -e .
|
| 45 |
+
cd verl
|
| 46 |
+
pip install -e .
|
| 47 |
+
```
|
| 48 |
+
> **Note**: If you encounter issues caused by the `pyairports` library, please refer to this hot-fix [solution](https://github.com/ElliottYan/LUFFY?tab=readme-ov-file#update-98).
|
| 49 |
+
|
| 50 |
+
For the `flash-attn` library, we use the `v2.7.4-post1` release and recommend installing it via the pre-built wheel. Please adjust based on your environment.
|
| 51 |
+
```bash
|
| 52 |
+
wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.4cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
|
| 53 |
+
pip install flash_attn-2.7.4.post1+cu12torch2.4cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
|
| 54 |
+
```
|
| 55 |
+
|
| 56 |
+
### Data Preparation and Training
|
| 57 |
+
|
| 58 |
+
For data preparation and training instructions, please refer to the [Usage section](https://github.com/ElliottYan/LUFFY/tree/main/ExGRPO#usage) in the GitHub repository.
|
| 59 |
+
|
| 60 |
+
### Evaluation
|
| 61 |
+
|
| 62 |
+
The GitHub repository also provides scripts for evaluating the model's performance on various benchmarks. For example, you can use the `generate_vllm.py` script for generation and evaluation, as detailed in the [Evaluation section](https://github.com/ElliottYan/LUFFY/tree/main/ExGRPO#evaluation) of the GitHub repository.
|
| 63 |
+
|
| 64 |
+
## Released Models
|
| 65 |
+
|
| 66 |
+
The ExGRPO framework includes several released models. This table (copied from the GitHub README) provides an overview:
|
| 67 |
+
|
| 68 |
+
| **Model** | **Huggingface** | **Base Model** |
|
| 69 |
+
|-----------------------------------|------------------|------------------|
|
| 70 |
+
| ExGRPO-Qwen2.5-Math-7B-Zero | https://huggingface.co/rzzhan/ExGRPO-Qwen2.5-Math-7B-Zero | Qwen2.5-Math-7B |
|
| 71 |
+
| ExGRPO-LUFFY-7B-Continual | https://huggingface.co/rzzhan/ExGRPO-LUFFY-7B-Continual | LUFFY-Qwen-Math-7B-Zero |
|
| 72 |
+
| ExGRPO-Qwen2.5-7B-Instruct | https://huggingface.co/rzzhan/ExGRPO-Qwen2.5-7B-Instruct | Qwen2.5-7B Instruct |
|
| 73 |
+
| ExGRPO-Qwen2.5-Math-1.5B-Zero | https://huggingface.co/rzzhan/ExGRPO-Qwen2.5-Math-1.5B-Zero | Qwen2.5-Math-1.5B |
|
| 74 |
+
| ExGRPO-Llama3.1-8B-Zero | https://huggingface.co/rzzhan/ExGRPO-Llama3.1-8B-Zero | Llama3.1-8B |
|
| 75 |
+
| ExGRPO-Llama3.1-8B-Instruct | https://huggingface.co/rzzhan/ExGRPO-Llama3.1-8B-Instruct | Llama3.1-8B Instruct |
|
| 76 |
+
|
| 77 |
+
## Citation
|
| 78 |
+
|
| 79 |
+
If you find our model, data, or evaluation code useful, please kindly cite our paper:
|
| 80 |
+
```bib
|
| 81 |
+
@article{zhan2025exgrpo,
|
| 82 |
+
title={ExGRPO: Learning to Reason from Experience},
|
| 83 |
+
author={Runzhe Zhan and Yafu Li and Zhi Wang and Xiaoye Qu and Dongrui Liu and Jing Shao and Derek F. Wong and Yu Cheng},
|
| 84 |
+
year={2025},
|
| 85 |
+
journal = {ArXiv preprint},
|
| 86 |
+
volume = {2510.02245},
|
| 87 |
+
url={https://arxiv.org/abs/2510.02245},
|
| 88 |
+
}
|
| 89 |
+
```
|