BERT PII Detection Model (ONNX)

This model is a BERT-based token classification model fine-tuned for detecting Personally Identifiable Information (PII) in text. The model is provided in ONNX format for efficient inference across different platforms.

Model Description

  • Model Type: Token Classification (Named Entity Recognition)
  • Base Model: bert-base-uncased (Google BERT)
  • Format: ONNX
  • Language: English
  • License: Apache 2.0
  • Training Dataset: ai4privacy/pii-masking-300k

Intended Use

This model is designed to identify and classify various types of personally identifiable information in text, including but not limited to:

Supported PII Categories

The model can detect 27 different types of PII entities:

Personal Identifiers

  • GIVENNAME1, GIVENNAME2 - First/given names
  • LASTNAME1, LASTNAME2, LASTNAME3 - Last/family names
  • USERNAME - Usernames
  • TITLE - Personal titles
  • SEX - Gender information

Contact Information

  • EMAIL - Email addresses
  • TEL - Telephone numbers
  • IP - IP addresses

Location Information

  • STREET - Street addresses
  • CITY - City names
  • STATE - State/province names
  • COUNTRY - Country names
  • POSTCODE - Postal/ZIP codes
  • BUILDING - Building names/numbers
  • SECADDRESS - Secondary addresses
  • GEOCOORD - Geographic coordinates

Identification Documents

  • PASSPORT - Passport numbers
  • IDCARD - ID card numbers
  • DRIVERLICENSE - Driver's license numbers
  • SOCIALNUMBER - Social security numbers
  • PASS - Password information

Temporal Information

  • DATE - Date information
  • TIME - Time information
  • BOD - Birth date

The model uses BIO (Begin-Inside-Outside) tagging scheme, where:

  • B-[ENTITY] marks the beginning of an entity
  • I-[ENTITY] marks the continuation of an entity
  • O marks tokens that are not PII

Usage

Requirements

pip install onnxruntime transformers tokenizers

Python Example

import onnxruntime as ort
from transformers import AutoTokenizer
import numpy as np

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("path/to/model")

# Load ONNX model
session = ort.InferenceSession("onnx/model.onnx")

# Prepare input text
text = "My name is John Smith and my email is [email protected]"
inputs = tokenizer(text, return_tensors="np", padding=True, truncation=True)

# Run inference
outputs = session.run(
    None,
    {
        "input_ids": inputs["input_ids"].astype(np.int64),
        "attention_mask": inputs["attention_mask"].astype(np.int64),
        "token_type_ids": inputs["token_type_ids"].astype(np.int64)
    }
)

# Get predictions
logits = outputs[0]
predictions = np.argmax(logits, axis=-1)

# Map predictions to labels
id2label = {
    0: "B-BOD", 1: "B-BUILDING", 2: "B-CITY", 3: "B-COUNTRY",
    4: "B-DATE", 5: "B-DRIVERLICENSE", 6: "B-EMAIL", 7: "B-GEOCOORD",
    8: "B-GIVENNAME1", 9: "B-GIVENNAME2", 10: "B-IDCARD", 11: "B-IP",
    12: "B-LASTNAME1", 13: "B-LASTNAME2", 14: "B-LASTNAME3", 15: "B-PASS",
    16: "B-PASSPORT", 17: "B-POSTCODE", 18: "B-SECADDRESS", 19: "B-SEX",
    20: "B-SOCIALNUMBER", 21: "B-STATE", 22: "B-STREET", 23: "B-TEL",
    24: "B-TIME", 25: "B-TITLE", 26: "B-USERNAME", 27: "I-BOD",
    28: "I-BUILDING", 29: "I-CITY", 30: "I-COUNTRY", 31: "I-DATE",
    32: "I-DRIVERLICENSE", 33: "I-EMAIL", 34: "I-GEOCOORD", 35: "I-GIVENNAME1",
    36: "I-GIVENNAME2", 37: "I-IDCARD", 38: "I-IP", 39: "I-LASTNAME1",
    40: "I-LASTNAME2", 41: "I-LASTNAME3", 42: "I-PASS", 43: "I-PASSPORT",
    44: "I-POSTCODE", 45: "I-SECADDRESS", 46: "I-SEX", 47: "I-SOCIALNUMBER",
    48: "I-STATE", 49: "I-STREET", 50: "I-TEL", 51: "I-TIME",
    52: "I-TITLE", 53: "I-USERNAME", 54: "O"
}

# Decode predictions
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
labels = [id2label[pred] for pred in predictions[0]]

for token, label in zip(tokens, labels):
    if token not in ["[CLS]", "[SEP]", "[PAD]"]:
        print(f"{token}: {label}")

JavaScript/Node.js Example

const ort = require('onnxruntime-node');
const { AutoTokenizer } = require('@xenova/transformers');

async function detectPII(text) {
    // Load tokenizer
    const tokenizer = await AutoTokenizer.from_pretrained('path/to/model');
    
    // Load ONNX model
    const session = await ort.InferenceSession.create('onnx/model.onnx');
    
    // Tokenize input
    const inputs = await tokenizer(text, { 
        padding: true, 
        truncation: true, 
        return_tensors: 'ortvalue' 
    });
    
    // Run inference
    const outputs = await session.run(inputs);
    
    // Process outputs
    const logits = outputs.logits;
    // ... process predictions
}

Model Architecture

  • Architecture: BertForTokenClassification
  • Hidden Size: 768
  • Intermediate Size: 3072
  • Attention Heads: 12 (typical for BERT-base)
  • Hidden Layers: 12 (typical for BERT-base)
  • Activation Function: GELU
  • Max Sequence Length: 512 tokens
  • Dropout: 0.1
  • Number of Labels: 55 (54 PII labels + Outside)

Training Details

Training Data

The model was fine-tuned on the ai4privacy/pii-masking-300k dataset:

  • Dataset: ai4privacy/pii-masking-300k
  • Size: 300,000 examples
  • Format: Pre-annotated text with BIO labels for PII entities
  • License: Check dataset page for license details

Training Procedure

  • Base Model: bert-base-uncased (Google BERT)
  • Tokenization: WordPiece tokenization with lowercase normalization
  • Max Sequence Length: 128 tokens (optimized for efficiency)
  • Padding Token: [PAD] (ID: 0)
  • Unknown Token: [UNK] (ID: 100)
  • CLS Token: [CLS] (ID: 101)
  • SEP Token: [SEP] (ID: 102)
  • Mask Token: [MASK] (ID: 103)

Training Hyperparameters

  • Learning Rate: 2e-5
  • Batch Size: 16 (per device)
  • Number of Epochs: 3
  • Weight Decay: 0.01
  • Optimizer: AdamW (default)
  • Training Platform: Kaggle with GPU T4 x2
  • Training Time: ~1-2 hours

Evaluation Strategy

  • Evaluation Metric: SeqEval (standard for NER tasks)
  • Evaluation Strategy: Every epoch
  • Metrics Tracked:
    • Precision
    • Recall
    • F1 Score
    • Accuracy

Evaluation

The model should be evaluated on appropriate PII detection benchmarks using standard NER metrics (F1, Precision, Recall) for each entity type.

Limitations and Bias

  • The model's performance may vary across different text domains and writing styles
  • May not generalize well to PII formats from countries/regions not well-represented in training data
  • Context-dependent entities (e.g., names that are also common words) may be challenging
  • The model may have biases present in the training data
  • Should not be used as the sole method for PII detection in critical applications without human review

Ethical Considerations

This model is designed to help protect privacy by detecting PII in text. However:

  • The model is not perfect and may miss some PII (false negatives) or incorrectly flag non-PII (false positives)
  • Should be used as part of a comprehensive privacy protection strategy
  • Users should be aware of applicable privacy regulations (GDPR, CCPA, etc.)
  • The model's use should comply with all relevant laws and regulations
  • Consider the implications of automated PII detection in your specific use case

ONNX Runtime Compatibility

This model is compatible with ONNX Runtime and can be deployed on:

  • CPU (optimized for inference)
  • GPU (CUDA)
  • Edge devices
  • Web browsers (via ONNX.js)
  • Mobile devices (iOS/Android)

File Structure

.
β”œβ”€β”€ README.md                    # This file
β”œβ”€β”€ config.json                  # Model configuration
β”œβ”€β”€ tokenizer_config.json        # Tokenizer configuration
β”œβ”€β”€ tokenizer.json              # Fast tokenizer
β”œβ”€β”€ vocab.txt                   # Vocabulary file
β”œβ”€β”€ special_tokens_map.json     # Special tokens mapping
└── onnx/
    └── model.onnx              # ONNX model file

Citation

If you use this model in your research or application, please cite:

@misc{bert-pii-onnx,
  title={BERT PII Detection Model (ONNX)},
  author={Your Name/Organization},
  year={2025},
  howpublished={\url{https://huggingface.co/your-username/bert-pii-onnx}}
}

Base Model Citation

This model is based on BERT. Please also cite the original BERT paper:

@article{devlin2018bert,
  title={BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding},
  author={Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina},
  journal={arXiv preprint arXiv:1810.04805},
  year={2018}
}

Contact

For questions, issues, or feedback about this model, please open an issue in the model repository.

Acknowledgments

Base Model

This model is built upon BERT (Bidirectional Encoder Representations from Transformers) developed by Google Research:

Dataset

The model was trained on ai4privacy/pii-masking-300k:

  • Dataset: ai4privacy/pii-masking-300k
  • Creator: ai4privacy team on Hugging Face
  • Size: 300,000 examples with PII annotations
  • Please cite the dataset creators if you use this model
@misc{ai4privacy-pii-dataset,
  title={PII Masking 300K Dataset},
  author={ai4privacy},
  year={2024},
  howpublished={\url{https://huggingface.co/datasets/ai4privacy/pii-masking-300k}}
}

Technologies

Special Thanks

  • Hugging Face team for the Transformers library and model hub infrastructure
  • ONNX community for standardized model format and runtime
  • Contributors to the training dataset (if applicable)
Downloads last month
54
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support