Update README.md
Browse files
README.md
CHANGED
|
@@ -6,7 +6,7 @@ tags: []
|
|
| 6 |
# Model Card for Model ID
|
| 7 |
|
| 8 |
<!-- Provide a quick summary of what the model is/does. -->
|
| 9 |
-
**PPO-C** (PPO with Calibrated Reward
|
| 10 |
PPO-C adjusts standard reward model scores during PPO training. It maintains a running average of past reward scores as a dynamic threshold to
|
| 11 |
classify responses, and adjusts the reward scores based on model expressed verbalized confidence.
|
| 12 |
Please refer to our preprint ([Taming Overconfidence in LLMs: Reward Calibration in RLHF](https://arxiv.org/abs/2410.09724)) and [repo](https://github.com/SeanLeng1/Reward-Calibration) for more details.
|
|
|
|
| 6 |
# Model Card for Model ID
|
| 7 |
|
| 8 |
<!-- Provide a quick summary of what the model is/does. -->
|
| 9 |
+
**PPO-C** (PPO with Calibrated Reward Calculation) is an RLHF algorithm to mitigate verbalized overconfidence in RLHF-trained Large Language Models.
|
| 10 |
PPO-C adjusts standard reward model scores during PPO training. It maintains a running average of past reward scores as a dynamic threshold to
|
| 11 |
classify responses, and adjusts the reward scores based on model expressed verbalized confidence.
|
| 12 |
Please refer to our preprint ([Taming Overconfidence in LLMs: Reward Calibration in RLHF](https://arxiv.org/abs/2410.09724)) and [repo](https://github.com/SeanLeng1/Reward-Calibration) for more details.
|