nvidia
/

Aegis-AI-Content-Safety-LlamaGuard-Permissive-1.0

@@ -17,19 +17,19 @@ pipeline_tag: text-classification
 The use of this model is governed by the [Llama 2 Community License Agreement](https://ai.meta.com/llama/license/).
 ## Model Details
-Aegis-AI-Content-Safety-LlamaGuard-LLM-Permissive-1.0 is a LLM content safety model. It is a parameter efficient instruction tuned version of [Llama Guard](https://huggingface.co/meta-llama/LlamaGuard-7b) based on [Llama2-7B](https://arxiv.org/abs/2307.09288) trained on Nvidia's content safety dataset [Aegis Content Safety Dataset](https://huggingface.co/datasets/nvidia/Aegis-AI-Content-Safety-Dataset-1.0) covering Nvidia's broad taxonomy of 13 critical safety risk categories.
-Paper Details: [Aegis Content Moderation](https://arxiv.org/pdf/2404.05993.pdf#page=10.63)
 ### Model Description
-The Aegis-AI-Content-Safety-LlamaGuard-LLM-Permissive-1.0 model involves the following stages:
 1. System instruction including the safety taxonomy, a safety policy with inclusions and, exclusions.
 2. The system prompt instructs the LLM to moderate user prompt, partial dialog or full dialog.
 3. The LLM response is a string which can be either safe or unsafe. If the string generated by the LLM is "unsafe", on a new line, the category ID of violation is output by the LLM based on the policy in the system prompt.
 4. Novel safety risk categories and policy can be provided in the instruction for the model to categorize using the novel taxonomy and policy.
 5. The safety taxonomy and policy used to train the models contain 13 critically unsafe risk categories, a safe category and a "needs caution" category.
-6. Internally annotated dataset called Aegis-AI-Content-Safety-Dataset-1.0 of approximately 11,000 prompts and responses are used to instruction tune the model. Annotations are at dialog level not per turn.
 We have since collected in total 30,000 annotations on a further expanded taxonomy and future versions of the models will be trained on the full set. The annotations are at dialog level instead of per-turn level.
 7. Model is instruction tuned with safety instruction, with the LLM behaving as a classifier in this setting.
 PLEASE NOTE: Model has only been trained to perform prompt classification since the annotations were not available at turn level. If you wish to use the model for response classification, use the template as provided below.
@@ -155,7 +155,7 @@ O6
 ```
-The difference between this Llama Guard Permissive and the [Llama Guard Defensive](https://huggingface.co/nvidia/Aegis-AI-Content-Safety-LlamaGuard-Defensive-1.0) model is that, the permissive model is trained on the Aegis data where ***Needs Caution*** category is mapped to ***Safe*** whereas the for the defensive model, the ***Needs Caution*** category is mapped to ***Unsafe***.
 - **Developed by:** Shaona Ghosh, Nvidia
 - **Model type:** Instruction tuned LLama2-7B
@@ -169,7 +169,7 @@ Ethical use: Technology can have a profound impact on people and the world, and
 ### Direct Use
-- The Aegis-AI-Content-Safety-LlamaGuard-LLM-Permissive-1.0 model is for users who wants to safeguard or evaluate a general purpose LLM's generated content
 Model and dataset restrictions:
@@ -371,7 +371,7 @@ The inference code for this model is available through the NeMo Curator GitHub r
 ## Training Details
 ### Training Data
-The model has been trained on Nvidia's [Aegis Content Safety Dataset](https://huggingface.co/datasets/nvidia/Aegis-AI-Content-Safety-Dataset-1.0)
 * Human Prompts from Anthropic RLHF harmless dataset [Anthropic RLHF](https://huggingface.co/datasets/Anthropic/hh-rlhf)
 * LLM response generated from Mistral-7B-v0.1 [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1)
@@ -407,7 +407,7 @@ We use the [PEFT](https://huggingface.co/docs/peft/en/index) library from Huggin
 The model has been evaluated on the following benchmarks:
-* Test partition of Nvidia's content safety dataset [Aegis Content Safety Dataset](https://huggingface.co/datasets/nvidia/Aegis-AI-Content-Safety-Dataset-1.0)
 * [Toxic Chat Dataset](https://huggingface.co/datasets/lmsys/toxic-chat)
 * [Open AI Moderation Dataset](https://huggingface.co/datasets/mmathys/openai-moderation-api-evaluation/tree/main)
 * [SimpleSafetyTests Benchmark](https://arxiv.org/html/2311.08370v2)
@@ -415,7 +415,7 @@ The model has been evaluated on the following benchmarks:
 #### Metrics
 We report F1 and AUPRC scores for the model on the evaluation benchmarks.
-### Results on Aegis Content Safety Test Set
  Model                  | AUPRC         |          F1   |
 ------------            |:-----------:  |-----------:   |
 Llama Guard Base        |0.930          |0.62           |

 The use of this model is governed by the [Llama 2 Community License Agreement](https://ai.meta.com/llama/license/).
 ## Model Details
+Llama Nemotron Safety Guard Permissive V1, formerly known as Aegis-AI-Content-Safety-LlamaGuard-LLM-Permissive-1.0, is an LLM content safety model. It is a parameter efficient instruction tuned version of [Llama Guard](https://huggingface.co/meta-llama/LlamaGuard-7b) based on [Llama2-7B](https://arxiv.org/abs/2307.09288) trained on Nvidia's content safety dataset [Aegis Content Safety Dataset](https://huggingface.co/datasets/nvidia/Aegis-AI-Content-Safety-Dataset-1.0) covering Nvidia's broad taxonomy of 13 critical safety risk categories.
+Paper Details: [Aegis 1.0: Online Adaptive AI Content Safety Moderation with Ensemble of LLM Experts](https://arxiv.org/pdf/2404.05993.pdf#page=10.63)
 ### Model Description
+The Llama-2-Nemotron-Safety-Guard-Permissive-7B-v1 model involves the following stages:
 1. System instruction including the safety taxonomy, a safety policy with inclusions and, exclusions.
 2. The system prompt instructs the LLM to moderate user prompt, partial dialog or full dialog.
 3. The LLM response is a string which can be either safe or unsafe. If the string generated by the LLM is "unsafe", on a new line, the category ID of violation is output by the LLM based on the policy in the system prompt.
 4. Novel safety risk categories and policy can be provided in the instruction for the model to categorize using the novel taxonomy and policy.
 5. The safety taxonomy and policy used to train the models contain 13 critically unsafe risk categories, a safe category and a "needs caution" category.
+6. Internally annotated dataset called [Nemotron Content Safety Dataset V1](https://huggingface.co/datasets/nvidia/Aegis-AI-Content-Safety-Dataset-1.0) of approximately 11,000 prompts and responses are used to instruction tune the model. Annotations are at dialog level not per turn.
 We have since collected in total 30,000 annotations on a further expanded taxonomy and future versions of the models will be trained on the full set. The annotations are at dialog level instead of per-turn level.
 7. Model is instruction tuned with safety instruction, with the LLM behaving as a classifier in this setting.
 PLEASE NOTE: Model has only been trained to perform prompt classification since the annotations were not available at turn level. If you wish to use the model for response classification, use the template as provided below.
 ```
+The difference between this Llama Nemotron Safety Guard Permissive and the [Llama Nemotron Safety Guard Defensive](https://huggingface.co/nvidia/Aegis-AI-Content-Safety-LlamaGuard-Defensive-1.0) model is that, the permissive model is trained on the Aegis data where ***Needs Caution*** category is mapped to ***Safe*** whereas the for the defensive model, the ***Needs Caution*** category is mapped to ***Unsafe***.
 - **Developed by:** Shaona Ghosh, Nvidia
 - **Model type:** Instruction tuned LLama2-7B
 ### Direct Use
+- The Llama-2-Nemotron-Safety-Guard-Permissive-7B-v1 model is for users who wants to safeguard or evaluate a general purpose LLM's generated content
 Model and dataset restrictions:
 ## Training Details
 ### Training Data
+The model has been trained on Nvidia's [Nemotron Content Safety Dataset V1](https://huggingface.co/datasets/nvidia/Aegis-AI-Content-Safety-Dataset-1.0)
 * Human Prompts from Anthropic RLHF harmless dataset [Anthropic RLHF](https://huggingface.co/datasets/Anthropic/hh-rlhf)
 * LLM response generated from Mistral-7B-v0.1 [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1)
 The model has been evaluated on the following benchmarks:
+* Test partition of Nvidia's content safety dataset [Nemotron Content Safety Dataset V1](https://huggingface.co/datasets/nvidia/Aegis-AI-Content-Safety-Dataset-1.0)
 * [Toxic Chat Dataset](https://huggingface.co/datasets/lmsys/toxic-chat)
 * [Open AI Moderation Dataset](https://huggingface.co/datasets/mmathys/openai-moderation-api-evaluation/tree/main)
 * [SimpleSafetyTests Benchmark](https://arxiv.org/html/2311.08370v2)
 #### Metrics
 We report F1 and AUPRC scores for the model on the evaluation benchmarks.
+### Results on Nemotron Content Safety V1 Test Set
  Model                  | AUPRC         |          F1   |
 ------------            |:-----------:  |-----------:   |
 Llama Guard Base        |0.930          |0.62           |