--- language: en license: apache-2.0 tags: - bert - token-classification - ner - pii - privacy - onnx - personal-information datasets: - custom metrics: - f1 - precision - recall model-index: - name: bert-pii-onnx results: [] --- # BERT PII Detection Model (ONNX) This model is a BERT-based token classification model fine-tuned for detecting Personally Identifiable Information (PII) in text. The model is provided in ONNX format for efficient inference across different platforms. ## Model Description - **Model Type:** Token Classification (Named Entity Recognition) - **Base Model:** `bert-base-uncased` (Google BERT) - **Format:** ONNX - **Language:** English - **License:** Apache 2.0 - **Training Dataset:** ai4privacy/pii-masking-300k ## Intended Use This model is designed to identify and classify various types of personally identifiable information in text, including but not limited to: ### Supported PII Categories The model can detect 27 different types of PII entities: #### Personal Identifiers - **GIVENNAME1, GIVENNAME2** - First/given names - **LASTNAME1, LASTNAME2, LASTNAME3** - Last/family names - **USERNAME** - Usernames - **TITLE** - Personal titles - **SEX** - Gender information #### Contact Information - **EMAIL** - Email addresses - **TEL** - Telephone numbers - **IP** - IP addresses #### Location Information - **STREET** - Street addresses - **CITY** - City names - **STATE** - State/province names - **COUNTRY** - Country names - **POSTCODE** - Postal/ZIP codes - **BUILDING** - Building names/numbers - **SECADDRESS** - Secondary addresses - **GEOCOORD** - Geographic coordinates #### Identification Documents - **PASSPORT** - Passport numbers - **IDCARD** - ID card numbers - **DRIVERLICENSE** - Driver's license numbers - **SOCIALNUMBER** - Social security numbers - **PASS** - Password information #### Temporal Information - **DATE** - Date information - **TIME** - Time information - **BOD** - Birth date The model uses BIO (Begin-Inside-Outside) tagging scheme, where: - `B-[ENTITY]` marks the beginning of an entity - `I-[ENTITY]` marks the continuation of an entity - `O` marks tokens that are not PII ## Usage ### Requirements ```bash pip install onnxruntime transformers tokenizers ``` ### Python Example ```python import onnxruntime as ort from transformers import AutoTokenizer import numpy as np # Load tokenizer tokenizer = AutoTokenizer.from_pretrained("path/to/model") # Load ONNX model session = ort.InferenceSession("onnx/model.onnx") # Prepare input text text = "My name is John Smith and my email is john.smith@example.com" inputs = tokenizer(text, return_tensors="np", padding=True, truncation=True) # Run inference outputs = session.run( None, { "input_ids": inputs["input_ids"].astype(np.int64), "attention_mask": inputs["attention_mask"].astype(np.int64), "token_type_ids": inputs["token_type_ids"].astype(np.int64) } ) # Get predictions logits = outputs[0] predictions = np.argmax(logits, axis=-1) # Map predictions to labels id2label = { 0: "B-BOD", 1: "B-BUILDING", 2: "B-CITY", 3: "B-COUNTRY", 4: "B-DATE", 5: "B-DRIVERLICENSE", 6: "B-EMAIL", 7: "B-GEOCOORD", 8: "B-GIVENNAME1", 9: "B-GIVENNAME2", 10: "B-IDCARD", 11: "B-IP", 12: "B-LASTNAME1", 13: "B-LASTNAME2", 14: "B-LASTNAME3", 15: "B-PASS", 16: "B-PASSPORT", 17: "B-POSTCODE", 18: "B-SECADDRESS", 19: "B-SEX", 20: "B-SOCIALNUMBER", 21: "B-STATE", 22: "B-STREET", 23: "B-TEL", 24: "B-TIME", 25: "B-TITLE", 26: "B-USERNAME", 27: "I-BOD", 28: "I-BUILDING", 29: "I-CITY", 30: "I-COUNTRY", 31: "I-DATE", 32: "I-DRIVERLICENSE", 33: "I-EMAIL", 34: "I-GEOCOORD", 35: "I-GIVENNAME1", 36: "I-GIVENNAME2", 37: "I-IDCARD", 38: "I-IP", 39: "I-LASTNAME1", 40: "I-LASTNAME2", 41: "I-LASTNAME3", 42: "I-PASS", 43: "I-PASSPORT", 44: "I-POSTCODE", 45: "I-SECADDRESS", 46: "I-SEX", 47: "I-SOCIALNUMBER", 48: "I-STATE", 49: "I-STREET", 50: "I-TEL", 51: "I-TIME", 52: "I-TITLE", 53: "I-USERNAME", 54: "O" } # Decode predictions tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0]) labels = [id2label[pred] for pred in predictions[0]] for token, label in zip(tokens, labels): if token not in ["[CLS]", "[SEP]", "[PAD]"]: print(f"{token}: {label}") ``` ### JavaScript/Node.js Example ```javascript const ort = require('onnxruntime-node'); const { AutoTokenizer } = require('@xenova/transformers'); async function detectPII(text) { // Load tokenizer const tokenizer = await AutoTokenizer.from_pretrained('path/to/model'); // Load ONNX model const session = await ort.InferenceSession.create('onnx/model.onnx'); // Tokenize input const inputs = await tokenizer(text, { padding: true, truncation: true, return_tensors: 'ortvalue' }); // Run inference const outputs = await session.run(inputs); // Process outputs const logits = outputs.logits; // ... process predictions } ``` ## Model Architecture - **Architecture:** BertForTokenClassification - **Hidden Size:** 768 - **Intermediate Size:** 3072 - **Attention Heads:** 12 (typical for BERT-base) - **Hidden Layers:** 12 (typical for BERT-base) - **Activation Function:** GELU - **Max Sequence Length:** 512 tokens - **Dropout:** 0.1 - **Number of Labels:** 55 (54 PII labels + Outside) ## Training Details ### Training Data The model was fine-tuned on the **ai4privacy/pii-masking-300k** dataset: - **Dataset:** [ai4privacy/pii-masking-300k](https://huggingface.co/datasets/ai4privacy/pii-masking-300k) - **Size:** 300,000 examples - **Format:** Pre-annotated text with BIO labels for PII entities - **License:** Check dataset page for license details ### Training Procedure - **Base Model:** `bert-base-uncased` (Google BERT) - **Tokenization:** WordPiece tokenization with lowercase normalization - **Max Sequence Length:** 128 tokens (optimized for efficiency) - **Padding Token:** [PAD] (ID: 0) - **Unknown Token:** [UNK] (ID: 100) - **CLS Token:** [CLS] (ID: 101) - **SEP Token:** [SEP] (ID: 102) - **Mask Token:** [MASK] (ID: 103) ### Training Hyperparameters - **Learning Rate:** 2e-5 - **Batch Size:** 16 (per device) - **Number of Epochs:** 3 - **Weight Decay:** 0.01 - **Optimizer:** AdamW (default) - **Training Platform:** Kaggle with GPU T4 x2 - **Training Time:** ~1-2 hours ### Evaluation Strategy - **Evaluation Metric:** SeqEval (standard for NER tasks) - **Evaluation Strategy:** Every epoch - **Metrics Tracked:** - Precision - Recall - F1 Score - Accuracy ## Evaluation The model should be evaluated on appropriate PII detection benchmarks using standard NER metrics (F1, Precision, Recall) for each entity type. ## Limitations and Bias - The model's performance may vary across different text domains and writing styles - May not generalize well to PII formats from countries/regions not well-represented in training data - Context-dependent entities (e.g., names that are also common words) may be challenging - The model may have biases present in the training data - Should not be used as the sole method for PII detection in critical applications without human review ## Ethical Considerations This model is designed to help protect privacy by detecting PII in text. However: - The model is not perfect and may miss some PII (false negatives) or incorrectly flag non-PII (false positives) - Should be used as part of a comprehensive privacy protection strategy - Users should be aware of applicable privacy regulations (GDPR, CCPA, etc.) - The model's use should comply with all relevant laws and regulations - Consider the implications of automated PII detection in your specific use case ## ONNX Runtime Compatibility This model is compatible with ONNX Runtime and can be deployed on: - CPU (optimized for inference) - GPU (CUDA) - Edge devices - Web browsers (via ONNX.js) - Mobile devices (iOS/Android) ## File Structure ``` . ├── README.md # This file ├── config.json # Model configuration ├── tokenizer_config.json # Tokenizer configuration ├── tokenizer.json # Fast tokenizer ├── vocab.txt # Vocabulary file ├── special_tokens_map.json # Special tokens mapping └── onnx/ └── model.onnx # ONNX model file ``` ## Citation If you use this model in your research or application, please cite: ```bibtex @misc{bert-pii-onnx, title={BERT PII Detection Model (ONNX)}, author={Your Name/Organization}, year={2025}, howpublished={\url{https://huggingface.co/your-username/bert-pii-onnx}} } ``` ### Base Model Citation This model is based on BERT. Please also cite the original BERT paper: ```bibtex @article{devlin2018bert, title={BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding}, author={Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina}, journal={arXiv preprint arXiv:1810.04805}, year={2018} } ``` ## Contact For questions, issues, or feedback about this model, please open an issue in the model repository. ## Acknowledgments ### Base Model This model is built upon **BERT (Bidirectional Encoder Representations from Transformers)** developed by Google Research: - Original BERT paper: [Devlin et al., 2018](https://arxiv.org/abs/1810.04805) - BERT is licensed under Apache 2.0 ### Dataset The model was trained on **ai4privacy/pii-masking-300k**: - Dataset: [ai4privacy/pii-masking-300k](https://huggingface.co/datasets/ai4privacy/pii-masking-300k) - Creator: ai4privacy team on Hugging Face - Size: 300,000 examples with PII annotations - Please cite the dataset creators if you use this model ```bibtex @misc{ai4privacy-pii-dataset, title={PII Masking 300K Dataset}, author={ai4privacy}, year={2024}, howpublished={\url{https://huggingface.co/datasets/ai4privacy/pii-masking-300k}} } ``` ### Technologies - **Transformers Library**: [Hugging Face](https://github.com/huggingface/transformers) - **ONNX**: [Open Neural Network Exchange](https://onnx.ai/) for cross-platform model deployment - **ONNX Runtime**: [Microsoft ONNX Runtime](https://onnxruntime.ai/) for efficient inference ### Special Thanks - Hugging Face team for the Transformers library and model hub infrastructure - ONNX community for standardized model format and runtime - Contributors to the training dataset (if applicable)