Text Classification
PEFT
Safetensors
English
prasoonv commited on
Commit
76a0618
·
verified ·
1 Parent(s): c7be539

Update model name

Browse files

Signed-off-by: Prasoon Varshney <[email protected]>

Files changed (1) hide show
  1. README.md +9 -9
README.md CHANGED
@@ -17,19 +17,19 @@ pipeline_tag: text-classification
17
  The use of this model is governed by the [Llama 2 Community License Agreement](https://ai.meta.com/llama/license/).
18
 
19
  ## Model Details
20
- Aegis-AI-Content-Safety-LlamaGuard-LLM-Permissive-1.0 is a LLM content safety model. It is a parameter efficient instruction tuned version of [Llama Guard](https://huggingface.co/meta-llama/LlamaGuard-7b) based on [Llama2-7B](https://arxiv.org/abs/2307.09288) trained on Nvidia's content safety dataset [Aegis Content Safety Dataset](https://huggingface.co/datasets/nvidia/Aegis-AI-Content-Safety-Dataset-1.0) covering Nvidia's broad taxonomy of 13 critical safety risk categories.
21
 
22
- Paper Details: [Aegis Content Moderation](https://arxiv.org/pdf/2404.05993.pdf#page=10.63)
23
 
24
  ### Model Description
25
- The Aegis-AI-Content-Safety-LlamaGuard-LLM-Permissive-1.0 model involves the following stages:
26
 
27
  1. System instruction including the safety taxonomy, a safety policy with inclusions and, exclusions.
28
  2. The system prompt instructs the LLM to moderate user prompt, partial dialog or full dialog.
29
  3. The LLM response is a string which can be either safe or unsafe. If the string generated by the LLM is "unsafe", on a new line, the category ID of violation is output by the LLM based on the policy in the system prompt.
30
  4. Novel safety risk categories and policy can be provided in the instruction for the model to categorize using the novel taxonomy and policy.
31
  5. The safety taxonomy and policy used to train the models contain 13 critically unsafe risk categories, a safe category and a "needs caution" category.
32
- 6. Internally annotated dataset called Aegis-AI-Content-Safety-Dataset-1.0 of approximately 11,000 prompts and responses are used to instruction tune the model. Annotations are at dialog level not per turn.
33
  We have since collected in total 30,000 annotations on a further expanded taxonomy and future versions of the models will be trained on the full set. The annotations are at dialog level instead of per-turn level.
34
  7. Model is instruction tuned with safety instruction, with the LLM behaving as a classifier in this setting.
35
  PLEASE NOTE: Model has only been trained to perform prompt classification since the annotations were not available at turn level. If you wish to use the model for response classification, use the template as provided below.
@@ -155,7 +155,7 @@ O6
155
  ```
156
 
157
 
158
- The difference between this Llama Guard Permissive and the [Llama Guard Defensive](https://huggingface.co/nvidia/Aegis-AI-Content-Safety-LlamaGuard-Defensive-1.0) model is that, the permissive model is trained on the Aegis data where ***Needs Caution*** category is mapped to ***Safe*** whereas the for the defensive model, the ***Needs Caution*** category is mapped to ***Unsafe***.
159
 
160
  - **Developed by:** Shaona Ghosh, Nvidia
161
  - **Model type:** Instruction tuned LLama2-7B
@@ -169,7 +169,7 @@ Ethical use: Technology can have a profound impact on people and the world, and
169
 
170
  ### Direct Use
171
 
172
- - The Aegis-AI-Content-Safety-LlamaGuard-LLM-Permissive-1.0 model is for users who wants to safeguard or evaluate a general purpose LLM's generated content
173
 
174
  Model and dataset restrictions:
175
 
@@ -371,7 +371,7 @@ The inference code for this model is available through the NeMo Curator GitHub r
371
  ## Training Details
372
 
373
  ### Training Data
374
- The model has been trained on Nvidia's [Aegis Content Safety Dataset](https://huggingface.co/datasets/nvidia/Aegis-AI-Content-Safety-Dataset-1.0)
375
 
376
  * Human Prompts from Anthropic RLHF harmless dataset [Anthropic RLHF](https://huggingface.co/datasets/Anthropic/hh-rlhf)
377
  * LLM response generated from Mistral-7B-v0.1 [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1)
@@ -407,7 +407,7 @@ We use the [PEFT](https://huggingface.co/docs/peft/en/index) library from Huggin
407
 
408
  The model has been evaluated on the following benchmarks:
409
 
410
- * Test partition of Nvidia's content safety dataset [Aegis Content Safety Dataset](https://huggingface.co/datasets/nvidia/Aegis-AI-Content-Safety-Dataset-1.0)
411
  * [Toxic Chat Dataset](https://huggingface.co/datasets/lmsys/toxic-chat)
412
  * [Open AI Moderation Dataset](https://huggingface.co/datasets/mmathys/openai-moderation-api-evaluation/tree/main)
413
  * [SimpleSafetyTests Benchmark](https://arxiv.org/html/2311.08370v2)
@@ -415,7 +415,7 @@ The model has been evaluated on the following benchmarks:
415
  #### Metrics
416
  We report F1 and AUPRC scores for the model on the evaluation benchmarks.
417
 
418
- ### Results on Aegis Content Safety Test Set
419
  Model | AUPRC | F1 |
420
  ------------ |:-----------: |-----------: |
421
  Llama Guard Base |0.930 |0.62 |
 
17
  The use of this model is governed by the [Llama 2 Community License Agreement](https://ai.meta.com/llama/license/).
18
 
19
  ## Model Details
20
+ Llama Nemotron Safety Guard Permissive V1, formerly known as Aegis-AI-Content-Safety-LlamaGuard-LLM-Permissive-1.0, is an LLM content safety model. It is a parameter efficient instruction tuned version of [Llama Guard](https://huggingface.co/meta-llama/LlamaGuard-7b) based on [Llama2-7B](https://arxiv.org/abs/2307.09288) trained on Nvidia's content safety dataset [Aegis Content Safety Dataset](https://huggingface.co/datasets/nvidia/Aegis-AI-Content-Safety-Dataset-1.0) covering Nvidia's broad taxonomy of 13 critical safety risk categories.
21
 
22
+ Paper Details: [Aegis 1.0: Online Adaptive AI Content Safety Moderation with Ensemble of LLM Experts](https://arxiv.org/pdf/2404.05993.pdf#page=10.63)
23
 
24
  ### Model Description
25
+ The Llama-2-Nemotron-Safety-Guard-Permissive-7B-v1 model involves the following stages:
26
 
27
  1. System instruction including the safety taxonomy, a safety policy with inclusions and, exclusions.
28
  2. The system prompt instructs the LLM to moderate user prompt, partial dialog or full dialog.
29
  3. The LLM response is a string which can be either safe or unsafe. If the string generated by the LLM is "unsafe", on a new line, the category ID of violation is output by the LLM based on the policy in the system prompt.
30
  4. Novel safety risk categories and policy can be provided in the instruction for the model to categorize using the novel taxonomy and policy.
31
  5. The safety taxonomy and policy used to train the models contain 13 critically unsafe risk categories, a safe category and a "needs caution" category.
32
+ 6. Internally annotated dataset called [Nemotron Content Safety Dataset V1](https://huggingface.co/datasets/nvidia/Aegis-AI-Content-Safety-Dataset-1.0) of approximately 11,000 prompts and responses are used to instruction tune the model. Annotations are at dialog level not per turn.
33
  We have since collected in total 30,000 annotations on a further expanded taxonomy and future versions of the models will be trained on the full set. The annotations are at dialog level instead of per-turn level.
34
  7. Model is instruction tuned with safety instruction, with the LLM behaving as a classifier in this setting.
35
  PLEASE NOTE: Model has only been trained to perform prompt classification since the annotations were not available at turn level. If you wish to use the model for response classification, use the template as provided below.
 
155
  ```
156
 
157
 
158
+ The difference between this Llama Nemotron Safety Guard Permissive and the [Llama Nemotron Safety Guard Defensive](https://huggingface.co/nvidia/Aegis-AI-Content-Safety-LlamaGuard-Defensive-1.0) model is that, the permissive model is trained on the Aegis data where ***Needs Caution*** category is mapped to ***Safe*** whereas the for the defensive model, the ***Needs Caution*** category is mapped to ***Unsafe***.
159
 
160
  - **Developed by:** Shaona Ghosh, Nvidia
161
  - **Model type:** Instruction tuned LLama2-7B
 
169
 
170
  ### Direct Use
171
 
172
+ - The Llama-2-Nemotron-Safety-Guard-Permissive-7B-v1 model is for users who wants to safeguard or evaluate a general purpose LLM's generated content
173
 
174
  Model and dataset restrictions:
175
 
 
371
  ## Training Details
372
 
373
  ### Training Data
374
+ The model has been trained on Nvidia's [Nemotron Content Safety Dataset V1](https://huggingface.co/datasets/nvidia/Aegis-AI-Content-Safety-Dataset-1.0)
375
 
376
  * Human Prompts from Anthropic RLHF harmless dataset [Anthropic RLHF](https://huggingface.co/datasets/Anthropic/hh-rlhf)
377
  * LLM response generated from Mistral-7B-v0.1 [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1)
 
407
 
408
  The model has been evaluated on the following benchmarks:
409
 
410
+ * Test partition of Nvidia's content safety dataset [Nemotron Content Safety Dataset V1](https://huggingface.co/datasets/nvidia/Aegis-AI-Content-Safety-Dataset-1.0)
411
  * [Toxic Chat Dataset](https://huggingface.co/datasets/lmsys/toxic-chat)
412
  * [Open AI Moderation Dataset](https://huggingface.co/datasets/mmathys/openai-moderation-api-evaluation/tree/main)
413
  * [SimpleSafetyTests Benchmark](https://arxiv.org/html/2311.08370v2)
 
415
  #### Metrics
416
  We report F1 and AUPRC scores for the model on the evaluation benchmarks.
417
 
418
+ ### Results on Nemotron Content Safety V1 Test Set
419
  Model | AUPRC | F1 |
420
  ------------ |:-----------: |-----------: |
421
  Llama Guard Base |0.930 |0.62 |