HINT-lab
/

llama3-8b-final-ppo-c-v0.3

Text Generation

text-generation-inference

Model card Files Files and versions

teapot123 commited on Oct 17, 2024

Commit

3d5b45a

·

verified ·

1 Parent(s): 5cc542c

Update README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -6,7 +6,7 @@ tags: []
 # Model Card for Model ID
 <!-- Provide a quick summary of what the model is/does. -->
-**PPO-C** (PPO with Calibrated Reward Score) is an RLHF algorithm to mitigate verbalized overconfidence in RLHF-trained Large Language Models.
 PPO-C adjusts standard reward model scores during PPO training. It maintains a running average of past reward scores as a dynamic threshold to
 classify responses, and adjusts the reward scores based on model expressed verbalized confidence.
 Please refer to our preprint ([Taming Overconfidence in LLMs: Reward Calibration in RLHF](https://arxiv.org/abs/2410.09724)) and [repo](https://github.com/SeanLeng1/Reward-Calibration) for more details.

 # Model Card for Model ID
 <!-- Provide a quick summary of what the model is/does. -->
+**PPO-C** (PPO with Calibrated Reward Calculation) is an RLHF algorithm to mitigate verbalized overconfidence in RLHF-trained Large Language Models.
 PPO-C adjusts standard reward model scores during PPO training. It maintains a running average of past reward scores as a dynamic threshold to
 classify responses, and adjusts the reward scores based on model expressed verbalized confidence.
 Please refer to our preprint ([Taming Overconfidence in LLMs: Reward Calibration in RLHF](https://arxiv.org/abs/2410.09724)) and [repo](https://github.com/SeanLeng1/Reward-Calibration) for more details.