sfermion commited on
Commit
8c9689c
Β·
verified Β·
1 Parent(s): 1fe1d66

Upload folder using huggingface_hub

Browse files
Files changed (2) hide show
  1. .DS_Store +0 -0
  2. README.md +337 -0
.DS_Store ADDED
Binary file (6.15 kB). View file
 
README.md ADDED
@@ -0,0 +1,337 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ license: apache-2.0
4
+ tags:
5
+ - bert
6
+ - token-classification
7
+ - ner
8
+ - pii
9
+ - privacy
10
+ - onnx
11
+ - personal-information
12
+ datasets:
13
+ - custom
14
+ metrics:
15
+ - f1
16
+ - precision
17
+ - recall
18
+ model-index:
19
+ - name: bert-pii-onnx
20
+ results: []
21
+ ---
22
+
23
+ # BERT PII Detection Model (ONNX)
24
+
25
+ This model is a BERT-based token classification model fine-tuned for detecting Personally Identifiable Information (PII) in text. The model is provided in ONNX format for efficient inference across different platforms.
26
+
27
+ ## Model Description
28
+
29
+ - **Model Type:** Token Classification (Named Entity Recognition)
30
+ - **Base Model:** `bert-base-uncased` (Google BERT)
31
+ - **Format:** ONNX
32
+ - **Language:** English
33
+ - **License:** Apache 2.0
34
+ - **Training Dataset:** ai4privacy/pii-masking-300k
35
+
36
+ ## Intended Use
37
+
38
+ This model is designed to identify and classify various types of personally identifiable information in text, including but not limited to:
39
+
40
+ ### Supported PII Categories
41
+
42
+ The model can detect 27 different types of PII entities:
43
+
44
+ #### Personal Identifiers
45
+ - **GIVENNAME1, GIVENNAME2** - First/given names
46
+ - **LASTNAME1, LASTNAME2, LASTNAME3** - Last/family names
47
+ - **USERNAME** - Usernames
48
+ - **TITLE** - Personal titles
49
+ - **SEX** - Gender information
50
+
51
+ #### Contact Information
52
+ - **EMAIL** - Email addresses
53
+ - **TEL** - Telephone numbers
54
+ - **IP** - IP addresses
55
+
56
+ #### Location Information
57
+ - **STREET** - Street addresses
58
+ - **CITY** - City names
59
+ - **STATE** - State/province names
60
+ - **COUNTRY** - Country names
61
+ - **POSTCODE** - Postal/ZIP codes
62
+ - **BUILDING** - Building names/numbers
63
+ - **SECADDRESS** - Secondary addresses
64
+ - **GEOCOORD** - Geographic coordinates
65
+
66
+ #### Identification Documents
67
+ - **PASSPORT** - Passport numbers
68
+ - **IDCARD** - ID card numbers
69
+ - **DRIVERLICENSE** - Driver's license numbers
70
+ - **SOCIALNUMBER** - Social security numbers
71
+ - **PASS** - Password information
72
+
73
+ #### Temporal Information
74
+ - **DATE** - Date information
75
+ - **TIME** - Time information
76
+ - **BOD** - Birth date
77
+
78
+ The model uses BIO (Begin-Inside-Outside) tagging scheme, where:
79
+ - `B-[ENTITY]` marks the beginning of an entity
80
+ - `I-[ENTITY]` marks the continuation of an entity
81
+ - `O` marks tokens that are not PII
82
+
83
+ ## Usage
84
+
85
+ ### Requirements
86
+
87
+ ```bash
88
+ pip install onnxruntime transformers tokenizers
89
+ ```
90
+
91
+ ### Python Example
92
+
93
+ ```python
94
+ import onnxruntime as ort
95
+ from transformers import AutoTokenizer
96
+ import numpy as np
97
+
98
+ # Load tokenizer
99
+ tokenizer = AutoTokenizer.from_pretrained("path/to/model")
100
+
101
+ # Load ONNX model
102
+ session = ort.InferenceSession("onnx/model.onnx")
103
+
104
+ # Prepare input text
105
+ text = "My name is John Smith and my email is [email protected]"
106
+ inputs = tokenizer(text, return_tensors="np", padding=True, truncation=True)
107
+
108
+ # Run inference
109
+ outputs = session.run(
110
+ None,
111
+ {
112
+ "input_ids": inputs["input_ids"].astype(np.int64),
113
+ "attention_mask": inputs["attention_mask"].astype(np.int64),
114
+ "token_type_ids": inputs["token_type_ids"].astype(np.int64)
115
+ }
116
+ )
117
+
118
+ # Get predictions
119
+ logits = outputs[0]
120
+ predictions = np.argmax(logits, axis=-1)
121
+
122
+ # Map predictions to labels
123
+ id2label = {
124
+ 0: "B-BOD", 1: "B-BUILDING", 2: "B-CITY", 3: "B-COUNTRY",
125
+ 4: "B-DATE", 5: "B-DRIVERLICENSE", 6: "B-EMAIL", 7: "B-GEOCOORD",
126
+ 8: "B-GIVENNAME1", 9: "B-GIVENNAME2", 10: "B-IDCARD", 11: "B-IP",
127
+ 12: "B-LASTNAME1", 13: "B-LASTNAME2", 14: "B-LASTNAME3", 15: "B-PASS",
128
+ 16: "B-PASSPORT", 17: "B-POSTCODE", 18: "B-SECADDRESS", 19: "B-SEX",
129
+ 20: "B-SOCIALNUMBER", 21: "B-STATE", 22: "B-STREET", 23: "B-TEL",
130
+ 24: "B-TIME", 25: "B-TITLE", 26: "B-USERNAME", 27: "I-BOD",
131
+ 28: "I-BUILDING", 29: "I-CITY", 30: "I-COUNTRY", 31: "I-DATE",
132
+ 32: "I-DRIVERLICENSE", 33: "I-EMAIL", 34: "I-GEOCOORD", 35: "I-GIVENNAME1",
133
+ 36: "I-GIVENNAME2", 37: "I-IDCARD", 38: "I-IP", 39: "I-LASTNAME1",
134
+ 40: "I-LASTNAME2", 41: "I-LASTNAME3", 42: "I-PASS", 43: "I-PASSPORT",
135
+ 44: "I-POSTCODE", 45: "I-SECADDRESS", 46: "I-SEX", 47: "I-SOCIALNUMBER",
136
+ 48: "I-STATE", 49: "I-STREET", 50: "I-TEL", 51: "I-TIME",
137
+ 52: "I-TITLE", 53: "I-USERNAME", 54: "O"
138
+ }
139
+
140
+ # Decode predictions
141
+ tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
142
+ labels = [id2label[pred] for pred in predictions[0]]
143
+
144
+ for token, label in zip(tokens, labels):
145
+ if token not in ["[CLS]", "[SEP]", "[PAD]"]:
146
+ print(f"{token}: {label}")
147
+ ```
148
+
149
+ ### JavaScript/Node.js Example
150
+
151
+ ```javascript
152
+ const ort = require('onnxruntime-node');
153
+ const { AutoTokenizer } = require('@xenova/transformers');
154
+
155
+ async function detectPII(text) {
156
+ // Load tokenizer
157
+ const tokenizer = await AutoTokenizer.from_pretrained('path/to/model');
158
+
159
+ // Load ONNX model
160
+ const session = await ort.InferenceSession.create('onnx/model.onnx');
161
+
162
+ // Tokenize input
163
+ const inputs = await tokenizer(text, {
164
+ padding: true,
165
+ truncation: true,
166
+ return_tensors: 'ortvalue'
167
+ });
168
+
169
+ // Run inference
170
+ const outputs = await session.run(inputs);
171
+
172
+ // Process outputs
173
+ const logits = outputs.logits;
174
+ // ... process predictions
175
+ }
176
+ ```
177
+
178
+ ## Model Architecture
179
+
180
+ - **Architecture:** BertForTokenClassification
181
+ - **Hidden Size:** 768
182
+ - **Intermediate Size:** 3072
183
+ - **Attention Heads:** 12 (typical for BERT-base)
184
+ - **Hidden Layers:** 12 (typical for BERT-base)
185
+ - **Activation Function:** GELU
186
+ - **Max Sequence Length:** 512 tokens
187
+ - **Dropout:** 0.1
188
+ - **Number of Labels:** 55 (54 PII labels + Outside)
189
+
190
+ ## Training Details
191
+
192
+ ### Training Data
193
+
194
+ The model was fine-tuned on the **ai4privacy/pii-masking-300k** dataset:
195
+ - **Dataset:** [ai4privacy/pii-masking-300k](https://huggingface.co/datasets/ai4privacy/pii-masking-300k)
196
+ - **Size:** 300,000 examples
197
+ - **Format:** Pre-annotated text with BIO labels for PII entities
198
+ - **License:** Check dataset page for license details
199
+
200
+ ### Training Procedure
201
+
202
+ - **Base Model:** `bert-base-uncased` (Google BERT)
203
+ - **Tokenization:** WordPiece tokenization with lowercase normalization
204
+ - **Max Sequence Length:** 128 tokens (optimized for efficiency)
205
+ - **Padding Token:** [PAD] (ID: 0)
206
+ - **Unknown Token:** [UNK] (ID: 100)
207
+ - **CLS Token:** [CLS] (ID: 101)
208
+ - **SEP Token:** [SEP] (ID: 102)
209
+ - **Mask Token:** [MASK] (ID: 103)
210
+
211
+ ### Training Hyperparameters
212
+
213
+ - **Learning Rate:** 2e-5
214
+ - **Batch Size:** 16 (per device)
215
+ - **Number of Epochs:** 3
216
+ - **Weight Decay:** 0.01
217
+ - **Optimizer:** AdamW (default)
218
+ - **Training Platform:** Kaggle with GPU T4 x2
219
+ - **Training Time:** ~1-2 hours
220
+
221
+ ### Evaluation Strategy
222
+
223
+ - **Evaluation Metric:** SeqEval (standard for NER tasks)
224
+ - **Evaluation Strategy:** Every epoch
225
+ - **Metrics Tracked:**
226
+ - Precision
227
+ - Recall
228
+ - F1 Score
229
+ - Accuracy
230
+
231
+ ## Evaluation
232
+
233
+ The model should be evaluated on appropriate PII detection benchmarks using standard NER metrics (F1, Precision, Recall) for each entity type.
234
+
235
+ ## Limitations and Bias
236
+
237
+ - The model's performance may vary across different text domains and writing styles
238
+ - May not generalize well to PII formats from countries/regions not well-represented in training data
239
+ - Context-dependent entities (e.g., names that are also common words) may be challenging
240
+ - The model may have biases present in the training data
241
+ - Should not be used as the sole method for PII detection in critical applications without human review
242
+
243
+ ## Ethical Considerations
244
+
245
+ This model is designed to help protect privacy by detecting PII in text. However:
246
+
247
+ - The model is not perfect and may miss some PII (false negatives) or incorrectly flag non-PII (false positives)
248
+ - Should be used as part of a comprehensive privacy protection strategy
249
+ - Users should be aware of applicable privacy regulations (GDPR, CCPA, etc.)
250
+ - The model's use should comply with all relevant laws and regulations
251
+ - Consider the implications of automated PII detection in your specific use case
252
+
253
+ ## ONNX Runtime Compatibility
254
+
255
+ This model is compatible with ONNX Runtime and can be deployed on:
256
+ - CPU (optimized for inference)
257
+ - GPU (CUDA)
258
+ - Edge devices
259
+ - Web browsers (via ONNX.js)
260
+ - Mobile devices (iOS/Android)
261
+
262
+ ## File Structure
263
+
264
+ ```
265
+ .
266
+ β”œβ”€β”€ README.md # This file
267
+ β”œβ”€β”€ config.json # Model configuration
268
+ β”œβ”€β”€ tokenizer_config.json # Tokenizer configuration
269
+ β”œβ”€β”€ tokenizer.json # Fast tokenizer
270
+ β”œβ”€β”€ vocab.txt # Vocabulary file
271
+ β”œβ”€β”€ special_tokens_map.json # Special tokens mapping
272
+ └── onnx/
273
+ └── model.onnx # ONNX model file
274
+ ```
275
+
276
+ ## Citation
277
+
278
+ If you use this model in your research or application, please cite:
279
+
280
+ ```bibtex
281
+ @misc{bert-pii-onnx,
282
+ title={BERT PII Detection Model (ONNX)},
283
+ author={Your Name/Organization},
284
+ year={2025},
285
+ howpublished={\url{https://huggingface.co/your-username/bert-pii-onnx}}
286
+ }
287
+ ```
288
+
289
+ ### Base Model Citation
290
+
291
+ This model is based on BERT. Please also cite the original BERT paper:
292
+
293
+ ```bibtex
294
+ @article{devlin2018bert,
295
+ title={BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding},
296
+ author={Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina},
297
+ journal={arXiv preprint arXiv:1810.04805},
298
+ year={2018}
299
+ }
300
+ ```
301
+
302
+ ## Contact
303
+
304
+ For questions, issues, or feedback about this model, please open an issue in the model repository.
305
+
306
+ ## Acknowledgments
307
+
308
+ ### Base Model
309
+ This model is built upon **BERT (Bidirectional Encoder Representations from Transformers)** developed by Google Research:
310
+ - Original BERT paper: [Devlin et al., 2018](https://arxiv.org/abs/1810.04805)
311
+ - BERT is licensed under Apache 2.0
312
+
313
+ ### Dataset
314
+ The model was trained on **ai4privacy/pii-masking-300k**:
315
+ - Dataset: [ai4privacy/pii-masking-300k](https://huggingface.co/datasets/ai4privacy/pii-masking-300k)
316
+ - Creator: ai4privacy team on Hugging Face
317
+ - Size: 300,000 examples with PII annotations
318
+ - Please cite the dataset creators if you use this model
319
+
320
+ ```bibtex
321
+ @misc{ai4privacy-pii-dataset,
322
+ title={PII Masking 300K Dataset},
323
+ author={ai4privacy},
324
+ year={2024},
325
+ howpublished={\url{https://huggingface.co/datasets/ai4privacy/pii-masking-300k}}
326
+ }
327
+ ```
328
+
329
+ ### Technologies
330
+ - **Transformers Library**: [Hugging Face](https://github.com/huggingface/transformers)
331
+ - **ONNX**: [Open Neural Network Exchange](https://onnx.ai/) for cross-platform model deployment
332
+ - **ONNX Runtime**: [Microsoft ONNX Runtime](https://onnxruntime.ai/) for efficient inference
333
+
334
+ ### Special Thanks
335
+ - Hugging Face team for the Transformers library and model hub infrastructure
336
+ - ONNX community for standardized model format and runtime
337
+ - Contributors to the training dataset (if applicable)