Spaces:

PatoFlamejanteTV
/

Safe-o-Bot

Running

App Files Files Community

PatoFlamejanteTV commited on 16 days ago

Commit

4dde8cd

verified ·

1 Parent(s): 72e922e

Create model_card.md

Browse files

Files changed (1) hide show

model_card.md +140 -0

model_card.md ADDED Viewed

	@@ -0,0 +1,140 @@

+Model Card — Text Safety Analyzer
+Model Overview
+Text Safety Analyzer is a multi-model ensemble designed to detect potentially harmful, unsafe, or obfuscated text. It classifies and explains risks across multiple dimensions — including harm to author, reader, or target — and identifies attempts to bypass AI filters or conceal malicious intent through obfuscation or ASCII-based payloads.
+The system integrates several specialized models from Hugging Face Hub, along with lightweight heuristic modules for text normalization, entropy analysis, and obfuscation detection.
+---
+Intended Uses
+This system is intended for research, moderation, and content filtering purposes. It provides transparent, explainable detection of potentially unsafe text, but should always be complemented by human review.
+It is not a censorship tool. It should not be used to make automated decisions without human oversight.
+Example Applications
+AI moderation pipelines
+Research on toxicity or social harm detection
+Pre-filtering user-generated content for platforms
+Analyzing obfuscation and prompt-injection attempts in AI contexts
+---
+Model Architecture
+Multi-Model Pipeline
+Step	Description	Model	Source
+Harm Detection	Detects harassment, hate, threats, self-harm, or general toxicity	unitary/toxic-bert	huggingface.co/unitary/toxic-bert
+Self-Harm Focus	Detects self-harm and suicidal ideation	citiusLTL/SHRoBERTa	huggingface.co/citiusLTL/SHRoBERTa
+Malicious URLs	Detects obfuscated or phishing URLs and domains	r3ddkahili/final-complete-malicious-url-model	huggingface.co/r3ddkahili/final-complete-malicious-url-model
+Jailbreak/Bypass Detection	Identifies text attempting to subvert AI filters or security	Heuristic + optional fine-tuned model	Local / Custom
+ASCII/Entropy Detection	Flags suspicious ASCII-art payloads or encoded data	Heuristic-based	Local / Custom
+---
+Input / Output Format
+Input: Raw text (UTF-8)
+Output (JSON):
+{
+  "raw": "original text...",
+  "normalized": "normalized text...",
+  "entropy": 3.47,
+  "flags": [
+    {
+      "type": "harm_classification",
+      "model": "unitary/toxic-bert",
+      "score": 0.91,
+      "explain": "High probability of toxicity or harm"
+    },
+    {
+      "type": "hidden_link_model",
+      "model": "r3ddkahili/final-complete-malicious-url-model",
+      "score": 0.84,
+      "explain": "Detected obfuscated or malicious link"
+    }
+  ]
+}
+---
+Evaluation Metrics
+Each component model retains its original performance benchmarks:
+Model	Task	F1	Precision	Recall
+unitary/toxic-bert	Toxic comment classification	~0.92	~0.91	~0.93
+citiusLTL/SHRoBERTa	Self-harm classification	~0.88	~0.87	~0.90
+r3ddkahili/final-complete-malicious-url-model	Malicious URL detection	~0.96	~0.96	~0.96
+The ensemble’s performance depends on thresholding and heuristic tuning, so users should benchmark on their own data for optimal precision-recall balance.
+---
+Limitations & Ethical Considerations
+Models may misclassify benign creative text as harmful (false positives).
+Cultural and linguistic bias may affect harm classification outcomes.
+Obfuscation detection relies on heuristics that can produce noise.
+The system does not identify context-specific harm (e.g. satire, irony).
+Use responsibly: never apply outputs directly to penalize users or restrict access without manual review.
+---
+Licensing & Attribution
+This project uses open models with their respective licenses:
+unitary/toxic-bert: Apache 2.0
+citiusLTL/SHRoBERTa: MIT
+r3ddkahili/final-complete-malicious-url-model: Apache 2.0
+This repository’s code is released under the MIT License.
+---
+Future Work
+Add fine-tuned model for jailbreak prompt detection
+Add multilingual ASCII/obfuscation classifier
+Integrate model caching and batch inference for performance
+Provide dataset and evaluation scripts for reproducibility
+---
+Contact: [email protected]
+Repository: PatoFlamejanteTV/Safe-O-Bot