PatoFlamejanteTV commited on
Commit
4dde8cd
·
verified ·
1 Parent(s): 72e922e

Create model_card.md

Browse files
Files changed (1) hide show
  1. model_card.md +140 -0
model_card.md ADDED
@@ -0,0 +1,140 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Model Card — Text Safety Analyzer
2
+
3
+ Model Overview
4
+
5
+ Text Safety Analyzer is a multi-model ensemble designed to detect potentially harmful, unsafe, or obfuscated text. It classifies and explains risks across multiple dimensions — including harm to author, reader, or target — and identifies attempts to bypass AI filters or conceal malicious intent through obfuscation or ASCII-based payloads.
6
+
7
+ The system integrates several specialized models from Hugging Face Hub, along with lightweight heuristic modules for text normalization, entropy analysis, and obfuscation detection.
8
+
9
+
10
+ ---
11
+
12
+ Intended Uses
13
+
14
+ This system is intended for research, moderation, and content filtering purposes. It provides transparent, explainable detection of potentially unsafe text, but should always be complemented by human review.
15
+
16
+ It is not a censorship tool. It should not be used to make automated decisions without human oversight.
17
+
18
+ Example Applications
19
+
20
+ AI moderation pipelines
21
+
22
+ Research on toxicity or social harm detection
23
+
24
+ Pre-filtering user-generated content for platforms
25
+
26
+ Analyzing obfuscation and prompt-injection attempts in AI contexts
27
+
28
+
29
+
30
+ ---
31
+
32
+ Model Architecture
33
+
34
+ Multi-Model Pipeline
35
+
36
+ Step Description Model Source
37
+
38
+ Harm Detection Detects harassment, hate, threats, self-harm, or general toxicity unitary/toxic-bert huggingface.co/unitary/toxic-bert
39
+ Self-Harm Focus Detects self-harm and suicidal ideation citiusLTL/SHRoBERTa huggingface.co/citiusLTL/SHRoBERTa
40
+ Malicious URLs Detects obfuscated or phishing URLs and domains r3ddkahili/final-complete-malicious-url-model huggingface.co/r3ddkahili/final-complete-malicious-url-model
41
+ Jailbreak/Bypass Detection Identifies text attempting to subvert AI filters or security Heuristic + optional fine-tuned model Local / Custom
42
+ ASCII/Entropy Detection Flags suspicious ASCII-art payloads or encoded data Heuristic-based Local / Custom
43
+
44
+
45
+
46
+ ---
47
+
48
+ Input / Output Format
49
+
50
+ Input: Raw text (UTF-8)
51
+
52
+ Output (JSON):
53
+
54
+ {
55
+ "raw": "original text...",
56
+ "normalized": "normalized text...",
57
+ "entropy": 3.47,
58
+ "flags": [
59
+ {
60
+ "type": "harm_classification",
61
+ "model": "unitary/toxic-bert",
62
+ "score": 0.91,
63
+ "explain": "High probability of toxicity or harm"
64
+ },
65
+ {
66
+ "type": "hidden_link_model",
67
+ "model": "r3ddkahili/final-complete-malicious-url-model",
68
+ "score": 0.84,
69
+ "explain": "Detected obfuscated or malicious link"
70
+ }
71
+ ]
72
+ }
73
+
74
+
75
+ ---
76
+
77
+ Evaluation Metrics
78
+
79
+ Each component model retains its original performance benchmarks:
80
+
81
+ Model Task F1 Precision Recall
82
+
83
+ unitary/toxic-bert Toxic comment classification ~0.92 ~0.91 ~0.93
84
+ citiusLTL/SHRoBERTa Self-harm classification ~0.88 ~0.87 ~0.90
85
+ r3ddkahili/final-complete-malicious-url-model Malicious URL detection ~0.96 ~0.96 ~0.96
86
+
87
+
88
+ The ensemble’s performance depends on thresholding and heuristic tuning, so users should benchmark on their own data for optimal precision-recall balance.
89
+
90
+
91
+ ---
92
+
93
+ Limitations & Ethical Considerations
94
+
95
+ Models may misclassify benign creative text as harmful (false positives).
96
+
97
+ Cultural and linguistic bias may affect harm classification outcomes.
98
+
99
+ Obfuscation detection relies on heuristics that can produce noise.
100
+
101
+ The system does not identify context-specific harm (e.g. satire, irony).
102
+
103
+
104
+ Use responsibly: never apply outputs directly to penalize users or restrict access without manual review.
105
+
106
+
107
+ ---
108
+
109
+ Licensing & Attribution
110
+
111
+ This project uses open models with their respective licenses:
112
+
113
+ unitary/toxic-bert: Apache 2.0
114
+
115
+ citiusLTL/SHRoBERTa: MIT
116
+
117
+ r3ddkahili/final-complete-malicious-url-model: Apache 2.0
118
+
119
+
120
+ This repository’s code is released under the MIT License.
121
+
122
+
123
+ ---
124
+
125
+ Future Work
126
+
127
+ Add fine-tuned model for jailbreak prompt detection
128
+
129
+ Add multilingual ASCII/obfuscation classifier
130
+
131
+ Integrate model caching and batch inference for performance
132
+
133
+ Provide dataset and evaluation scripts for reproducibility
134
+
135
+
136
+
137
+ ---
138
+
139
+ Contact: [email protected]
140
+ Repository: PatoFlamejanteTV/Safe-O-Bot