Update README.md
Browse files
README.md
CHANGED
|
@@ -10,6 +10,29 @@ base_model: internlm/OREAL-DeepSeek-R1-Distill-Qwen-7B
|
|
| 10 |
This model was converted to GGUF format from [`internlm/OREAL-DeepSeek-R1-Distill-Qwen-7B`](https://huggingface.co/internlm/OREAL-DeepSeek-R1-Distill-Qwen-7B) using llama.cpp via the ggml.ai's [GGUF-my-repo](https://huggingface.co/spaces/ggml-org/gguf-my-repo) space.
|
| 11 |
Refer to the [original model card](https://huggingface.co/internlm/OREAL-DeepSeek-R1-Distill-Qwen-7B) for more details on the model.
|
| 12 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 13 |
## Use with llama.cpp
|
| 14 |
Install llama.cpp through brew (works on Mac and Linux)
|
| 15 |
|
|
|
|
| 10 |
This model was converted to GGUF format from [`internlm/OREAL-DeepSeek-R1-Distill-Qwen-7B`](https://huggingface.co/internlm/OREAL-DeepSeek-R1-Distill-Qwen-7B) using llama.cpp via the ggml.ai's [GGUF-my-repo](https://huggingface.co/spaces/ggml-org/gguf-my-repo) space.
|
| 11 |
Refer to the [original model card](https://huggingface.co/internlm/OREAL-DeepSeek-R1-Distill-Qwen-7B) for more details on the model.
|
| 12 |
|
| 13 |
+
---
|
| 14 |
+
Introduction
|
| 15 |
+
|
| 16 |
+
|
| 17 |
+
|
| 18 |
+
|
| 19 |
+
We introduce OREAL-7B and OREAL-32B, a mathematical reasoning model series trained using Outcome REwArd-based reinforcement Learning, a novel RL framework designed for tasks where only binary outcome rewards are available.
|
| 20 |
+
|
| 21 |
+
|
| 22 |
+
With OREAL, a 7B model achieves 94.0 pass@1 accuracy on MATH-500, matching the performance of previous 32B models. OREAL-32B further surpasses previous distillation-trained 32B models, reaching 95.0 pass@1 accuracy on MATH-500.
|
| 23 |
+
|
| 24 |
+
|
| 25 |
+
|
| 26 |
+
|
| 27 |
+
|
| 28 |
+
Our method leverages best-of-N (BoN) sampling for behavior cloning
|
| 29 |
+
and reshapes negative sample rewards to ensure gradient consistency.
|
| 30 |
+
Also, to address the challenge of sparse rewards in long
|
| 31 |
+
chain-of-thought reasoning, we incorporate an on-policy token-level
|
| 32 |
+
reward model that identifies key tokens in reasoning trajectories for
|
| 33 |
+
importance sampling. For more details, please refer to our paper.
|
| 34 |
+
|
| 35 |
+
---
|
| 36 |
## Use with llama.cpp
|
| 37 |
Install llama.cpp through brew (works on Mac and Linux)
|
| 38 |
|