Update README.md
Browse files
README.md
CHANGED
|
@@ -63,7 +63,6 @@ It is a fine-tune of **Qwen 2.5-VL-7B** using ~10k synthetic Doc-to-Reasoning-to
|
|
| 63 |
1. **SFT**: One-epoch supervised fine-tuning on synthetic reasoning traces generated from public PDFs (10K input/output pairs).
|
| 64 |
2. **RL (GRPO)**: RL phase using a layout-centric reward (5K difficult image examples).
|
| 65 |
|
| 66 |
-
**Model before GRPO loses 80% time vs post-GRPO model (see win-rate matrix)**
|
| 67 |
|
| 68 |
## Example:
|
| 69 |
|
|
|
|
| 63 |
1. **SFT**: One-epoch supervised fine-tuning on synthetic reasoning traces generated from public PDFs (10K input/output pairs).
|
| 64 |
2. **RL (GRPO)**: RL phase using a layout-centric reward (5K difficult image examples).
|
| 65 |
|
|
|
|
| 66 |
|
| 67 |
## Example:
|
| 68 |
|