BERT PII Detection Model (ONNX)

This model is a BERT-based token classification model fine-tuned for detecting Personally Identifiable Information (PII) in text. The model is provided in ONNX format for efficient inference across different platforms.

Model Description

Model Type: Token Classification (Named Entity Recognition)
Base Model: bert-base-uncased (Google BERT)
Format: ONNX
Language: English
License: Apache 2.0
Training Dataset: ai4privacy/pii-masking-300k

Intended Use

This model is designed to identify and classify various types of personally identifiable information in text, including but not limited to:

Supported PII Categories

The model can detect 27 different types of PII entities:

Personal Identifiers

GIVENNAME1, GIVENNAME2 - First/given names
LASTNAME1, LASTNAME2, LASTNAME3 - Last/family names
USERNAME - Usernames
TITLE - Personal titles
SEX - Gender information

Contact Information

EMAIL - Email addresses
TEL - Telephone numbers
IP - IP addresses

Location Information

STREET - Street addresses
CITY - City names
STATE - State/province names
COUNTRY - Country names
POSTCODE - Postal/ZIP codes
BUILDING - Building names/numbers
SECADDRESS - Secondary addresses
GEOCOORD - Geographic coordinates

Identification Documents

PASSPORT - Passport numbers
IDCARD - ID card numbers
DRIVERLICENSE - Driver's license numbers
SOCIALNUMBER - Social security numbers
PASS - Password information

Temporal Information

DATE - Date information
TIME - Time information
BOD - Birth date

The model uses BIO (Begin-Inside-Outside) tagging scheme, where:

B-[ENTITY] marks the beginning of an entity
I-[ENTITY] marks the continuation of an entity
O marks tokens that are not PII

Usage

Requirements

pip install onnxruntime transformers tokenizers

Python Example

import onnxruntime as ort
from transformers import AutoTokenizer
import numpy as np

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("path/to/model")

# Load ONNX model
session = ort.InferenceSession("onnx/model.onnx")

# Prepare input text
text = "My name is John Smith and my email is [email protected]"
inputs = tokenizer(text, return_tensors="np", padding=True, truncation=True)

# Run inference
outputs = session.run(
    None,
    {
        "input_ids": inputs["input_ids"].astype(np.int64),
        "attention_mask": inputs["attention_mask"].astype(np.int64),
        "token_type_ids": inputs["token_type_ids"].astype(np.int64)
    }
)

# Get predictions
logits = outputs[0]
predictions = np.argmax(logits, axis=-1)

# Map predictions to labels
id2label = {
    0: "B-BOD", 1: "B-BUILDING", 2: "B-CITY", 3: "B-COUNTRY",
    4: "B-DATE", 5: "B-DRIVERLICENSE", 6: "B-EMAIL", 7: "B-GEOCOORD",
    8: "B-GIVENNAME1", 9: "B-GIVENNAME2", 10: "B-IDCARD", 11: "B-IP",
    12: "B-LASTNAME1", 13: "B-LASTNAME2", 14: "B-LASTNAME3", 15: "B-PASS",
    16: "B-PASSPORT", 17: "B-POSTCODE", 18: "B-SECADDRESS", 19: "B-SEX",
    20: "B-SOCIALNUMBER", 21: "B-STATE", 22: "B-STREET", 23: "B-TEL",
    24: "B-TIME", 25: "B-TITLE", 26: "B-USERNAME", 27: "I-BOD",
    28: "I-BUILDING", 29: "I-CITY", 30: "I-COUNTRY", 31: "I-DATE",
    32: "I-DRIVERLICENSE", 33: "I-EMAIL", 34: "I-GEOCOORD", 35: "I-GIVENNAME1",
    36: "I-GIVENNAME2", 37: "I-IDCARD", 38: "I-IP", 39: "I-LASTNAME1",
    40: "I-LASTNAME2", 41: "I-LASTNAME3", 42: "I-PASS", 43: "I-PASSPORT",
    44: "I-POSTCODE", 45: "I-SECADDRESS", 46: "I-SEX", 47: "I-SOCIALNUMBER",
    48: "I-STATE", 49: "I-STREET", 50: "I-TEL", 51: "I-TIME",
    52: "I-TITLE", 53: "I-USERNAME", 54: "O"
}

# Decode predictions
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
labels = [id2label[pred] for pred in predictions[0]]

for token, label in zip(tokens, labels):
    if token not in ["[CLS]", "[SEP]", "[PAD]"]:
        print(f"{token}: {label}")

JavaScript/Node.js Example

const ort = require('onnxruntime-node');
const { AutoTokenizer } = require('@xenova/transformers');

async function detectPII(text) {
    // Load tokenizer
    const tokenizer = await AutoTokenizer.from_pretrained('path/to/model');
    
    // Load ONNX model
    const session = await ort.InferenceSession.create('onnx/model.onnx');
    
    // Tokenize input
    const inputs = await tokenizer(text, { 
        padding: true, 
        truncation: true, 
        return_tensors: 'ortvalue' 
    });
    
    // Run inference
    const outputs = await session.run(inputs);
    
    // Process outputs
    const logits = outputs.logits;
    // ... process predictions
}

Model Architecture

Architecture: BertForTokenClassification
Hidden Size: 768
Intermediate Size: 3072
Attention Heads: 12 (typical for BERT-base)
Hidden Layers: 12 (typical for BERT-base)
Activation Function: GELU
Max Sequence Length: 512 tokens
Dropout: 0.1
Number of Labels: 55 (54 PII labels + Outside)

Training Details

Training Data

The model was fine-tuned on the ai4privacy/pii-masking-300k dataset:

Dataset: ai4privacy/pii-masking-300k
Size: 300,000 examples
Format: Pre-annotated text with BIO labels for PII entities
License: Check dataset page for license details

Training Procedure

Base Model: bert-base-uncased (Google BERT)
Tokenization: WordPiece tokenization with lowercase normalization
Max Sequence Length: 128 tokens (optimized for efficiency)
Padding Token: [PAD] (ID: 0)
Unknown Token: [UNK] (ID: 100)
CLS Token: [CLS] (ID: 101)
SEP Token: [SEP] (ID: 102)
Mask Token: [MASK] (ID: 103)

Training Hyperparameters

Learning Rate: 2e-5
Batch Size: 16 (per device)
Number of Epochs: 3
Weight Decay: 0.01
Optimizer: AdamW (default)
Training Platform: Kaggle with GPU T4 x2
Training Time: ~1-2 hours

Evaluation Strategy

Evaluation Metric: SeqEval (standard for NER tasks)
Evaluation Strategy: Every epoch
Metrics Tracked:
- Precision
- Recall
- F1 Score
- Accuracy

Evaluation

The model should be evaluated on appropriate PII detection benchmarks using standard NER metrics (F1, Precision, Recall) for each entity type.

Limitations and Bias

The model's performance may vary across different text domains and writing styles
May not generalize well to PII formats from countries/regions not well-represented in training data
Context-dependent entities (e.g., names that are also common words) may be challenging
The model may have biases present in the training data
Should not be used as the sole method for PII detection in critical applications without human review

Ethical Considerations

This model is designed to help protect privacy by detecting PII in text. However:

The model is not perfect and may miss some PII (false negatives) or incorrectly flag non-PII (false positives)
Should be used as part of a comprehensive privacy protection strategy
Users should be aware of applicable privacy regulations (GDPR, CCPA, etc.)
The model's use should comply with all relevant laws and regulations
Consider the implications of automated PII detection in your specific use case

ONNX Runtime Compatibility

This model is compatible with ONNX Runtime and can be deployed on:

CPU (optimized for inference)
GPU (CUDA)
Edge devices
Web browsers (via ONNX.js)
Mobile devices (iOS/Android)

File Structure

.
├── README.md                    # This file
├── config.json                  # Model configuration
├── tokenizer_config.json        # Tokenizer configuration
├── tokenizer.json              # Fast tokenizer
├── vocab.txt                   # Vocabulary file
├── special_tokens_map.json     # Special tokens mapping
└── onnx/
    └── model.onnx              # ONNX model file

Citation

If you use this model in your research or application, please cite:

@misc{bert-pii-onnx,
  title={BERT PII Detection Model (ONNX)},
  author={Your Name/Organization},
  year={2025},
  howpublished={\url{https://huggingface.co/your-username/bert-pii-onnx}}
}

Base Model Citation

This model is based on BERT. Please also cite the original BERT paper:

@article{devlin2018bert,
  title={BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding},
  author={Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina},
  journal={arXiv preprint arXiv:1810.04805},
  year={2018}
}

Contact

For questions, issues, or feedback about this model, please open an issue in the model repository.

Acknowledgments

Base Model

This model is built upon BERT (Bidirectional Encoder Representations from Transformers) developed by Google Research:

Original BERT paper: Devlin et al., 2018
BERT is licensed under Apache 2.0

Dataset

The model was trained on ai4privacy/pii-masking-300k:

Dataset: ai4privacy/pii-masking-300k
Creator: ai4privacy team on Hugging Face
Size: 300,000 examples with PII annotations
Please cite the dataset creators if you use this model

@misc{ai4privacy-pii-dataset,
  title={PII Masking 300K Dataset},
  author={ai4privacy},
  year={2024},
  howpublished={\url{https://huggingface.co/datasets/ai4privacy/pii-masking-300k}}
}

Technologies

Transformers Library: Hugging Face
ONNX: Open Neural Network Exchange for cross-platform model deployment
ONNX Runtime: Microsoft ONNX Runtime for efficient inference

Special Thanks

Hugging Face team for the Transformers library and model hub infrastructure
ONNX community for standardized model format and runtime
Contributors to the training dataset (if applicable)

Downloads last month: 54

Evaluation results

Metadata error: specify a dataset to view leaderboard