Uppaal
/

gpt-j-ProFS-toxicity

Text Generation

activation-steering

activation-editing

Model card Files Files and versions

Uppaal commited on 22 days ago

Commit

6f02751

·

verified ·

1 Parent(s): 8e6c3db

Update README.md

Files changed (1) hide show

README.md +4 -3

README.md CHANGED Viewed

@@ -36,12 +36,13 @@ base_model:
 # ProFS Editing for Safety
-This model accompanies the paper [Model Editing as a Robust and Denoised Variant of DPO: A Case Study on Toxicity](https://arxiv.org/abs/2405.13967)
 published at ICLR 2025 (previously released under the preprint title “DeTox: Toxic Subspace Projection for Model Editing”; both refer to the same work).
-ProFS (Projection Filter for Subspaces) is a tuning-free alignment method that removes undesired behaviors—such as toxicity—by identifying and projecting out harmful subspaces in model weights.
 **Key Features:**
 - Training-free & plug-and-play: edits weights directly, no gradient steps or architectural changes needed.

 # ProFS Editing for Safety
+This model is an edited version of [`EleutherAI/gpt-j-6b`](https://huggingface.co/EleutherAI/gpt-j-6b).
+Editing is applied through ProFS, to reduce toxicity.
+ProFS (Projection Filter for Subspaces) is a tuning-free alignment method that removes undesired behaviors by identifying and projecting out harmful subspaces in model weights.
+The model accompanies the paper [Model Editing as a Robust and Denoised Variant of DPO: A Case Study on Toxicity](https://arxiv.org/abs/2405.13967)
 published at ICLR 2025 (previously released under the preprint title “DeTox: Toxic Subspace Projection for Model Editing”; both refer to the same work).
 **Key Features:**
 - Training-free & plug-and-play: edits weights directly, no gradient steps or architectural changes needed.