---
language: en
license: apache-2.0
tags:
- bert
- token-classification
- ner
- pii
- privacy
- onnx
- personal-information
datasets:
- custom
metrics:
- f1
- precision
- recall
model-index:
- name: bert-pii-onnx
  results: []
---

# BERT PII Detection Model (ONNX)

This model is a BERT-based token classification model fine-tuned for detecting Personally Identifiable Information (PII) in text. The model is provided in ONNX format for efficient inference across different platforms.

## Model Description

- **Model Type:** Token Classification (Named Entity Recognition)
- **Base Model:** `bert-base-uncased` (Google BERT)
- **Format:** ONNX
- **Language:** English
- **License:** Apache 2.0
- **Training Dataset:** ai4privacy/pii-masking-300k

## Intended Use

This model is designed to identify and classify various types of personally identifiable information in text, including but not limited to:

### Supported PII Categories

The model can detect 27 different types of PII entities:

#### Personal Identifiers
- **GIVENNAME1, GIVENNAME2** - First/given names
- **LASTNAME1, LASTNAME2, LASTNAME3** - Last/family names
- **USERNAME** - Usernames
- **TITLE** - Personal titles
- **SEX** - Gender information

#### Contact Information
- **EMAIL** - Email addresses
- **TEL** - Telephone numbers
- **IP** - IP addresses

#### Location Information
- **STREET** - Street addresses
- **CITY** - City names
- **STATE** - State/province names
- **COUNTRY** - Country names
- **POSTCODE** - Postal/ZIP codes
- **BUILDING** - Building names/numbers
- **SECADDRESS** - Secondary addresses
- **GEOCOORD** - Geographic coordinates

#### Identification Documents
- **PASSPORT** - Passport numbers
- **IDCARD** - ID card numbers
- **DRIVERLICENSE** - Driver's license numbers
- **SOCIALNUMBER** - Social security numbers
- **PASS** - Password information

#### Temporal Information
- **DATE** - Date information
- **TIME** - Time information
- **BOD** - Birth date

The model uses BIO (Begin-Inside-Outside) tagging scheme, where:
- `B-[ENTITY]` marks the beginning of an entity
- `I-[ENTITY]` marks the continuation of an entity
- `O` marks tokens that are not PII

## Usage

### Requirements

```bash
pip install onnxruntime transformers tokenizers
```

### Python Example

```python
import onnxruntime as ort
from transformers import AutoTokenizer
import numpy as np

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("path/to/model")

# Load ONNX model
session = ort.InferenceSession("onnx/model.onnx")

# Prepare input text
text = "My name is John Smith and my email is john.smith@example.com"
inputs = tokenizer(text, return_tensors="np", padding=True, truncation=True)

# Run inference
outputs = session.run(
    None,
    {
        "input_ids": inputs["input_ids"].astype(np.int64),
        "attention_mask": inputs["attention_mask"].astype(np.int64),
        "token_type_ids": inputs["token_type_ids"].astype(np.int64)
    }
)

# Get predictions
logits = outputs[0]
predictions = np.argmax(logits, axis=-1)

# Map predictions to labels
id2label = {
    0: "B-BOD", 1: "B-BUILDING", 2: "B-CITY", 3: "B-COUNTRY",
    4: "B-DATE", 5: "B-DRIVERLICENSE", 6: "B-EMAIL", 7: "B-GEOCOORD",
    8: "B-GIVENNAME1", 9: "B-GIVENNAME2", 10: "B-IDCARD", 11: "B-IP",
    12: "B-LASTNAME1", 13: "B-LASTNAME2", 14: "B-LASTNAME3", 15: "B-PASS",
    16: "B-PASSPORT", 17: "B-POSTCODE", 18: "B-SECADDRESS", 19: "B-SEX",
    20: "B-SOCIALNUMBER", 21: "B-STATE", 22: "B-STREET", 23: "B-TEL",
    24: "B-TIME", 25: "B-TITLE", 26: "B-USERNAME", 27: "I-BOD",
    28: "I-BUILDING", 29: "I-CITY", 30: "I-COUNTRY", 31: "I-DATE",
    32: "I-DRIVERLICENSE", 33: "I-EMAIL", 34: "I-GEOCOORD", 35: "I-GIVENNAME1",
    36: "I-GIVENNAME2", 37: "I-IDCARD", 38: "I-IP", 39: "I-LASTNAME1",
    40: "I-LASTNAME2", 41: "I-LASTNAME3", 42: "I-PASS", 43: "I-PASSPORT",
    44: "I-POSTCODE", 45: "I-SECADDRESS", 46: "I-SEX", 47: "I-SOCIALNUMBER",
    48: "I-STATE", 49: "I-STREET", 50: "I-TEL", 51: "I-TIME",
    52: "I-TITLE", 53: "I-USERNAME", 54: "O"
}

# Decode predictions
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
labels = [id2label[pred] for pred in predictions[0]]

for token, label in zip(tokens, labels):
    if token not in ["[CLS]", "[SEP]", "[PAD]"]:
        print(f"{token}: {label}")
```

### JavaScript/Node.js Example

```javascript
const ort = require('onnxruntime-node');
const { AutoTokenizer } = require('@xenova/transformers');

async function detectPII(text) {
    // Load tokenizer
    const tokenizer = await AutoTokenizer.from_pretrained('path/to/model');
    
    // Load ONNX model
    const session = await ort.InferenceSession.create('onnx/model.onnx');
    
    // Tokenize input
    const inputs = await tokenizer(text, { 
        padding: true, 
        truncation: true, 
        return_tensors: 'ortvalue' 
    });
    
    // Run inference
    const outputs = await session.run(inputs);
    
    // Process outputs
    const logits = outputs.logits;
    // ... process predictions
}
```

## Model Architecture

- **Architecture:** BertForTokenClassification
- **Hidden Size:** 768
- **Intermediate Size:** 3072
- **Attention Heads:** 12 (typical for BERT-base)
- **Hidden Layers:** 12 (typical for BERT-base)
- **Activation Function:** GELU
- **Max Sequence Length:** 512 tokens
- **Dropout:** 0.1
- **Number of Labels:** 55 (54 PII labels + Outside)

## Training Details

### Training Data

The model was fine-tuned on the **ai4privacy/pii-masking-300k** dataset:
- **Dataset:** [ai4privacy/pii-masking-300k](https://huggingface.co/datasets/ai4privacy/pii-masking-300k)
- **Size:** 300,000 examples
- **Format:** Pre-annotated text with BIO labels for PII entities
- **License:** Check dataset page for license details

### Training Procedure

- **Base Model:** `bert-base-uncased` (Google BERT)
- **Tokenization:** WordPiece tokenization with lowercase normalization
- **Max Sequence Length:** 128 tokens (optimized for efficiency)
- **Padding Token:** [PAD] (ID: 0)
- **Unknown Token:** [UNK] (ID: 100)
- **CLS Token:** [CLS] (ID: 101)
- **SEP Token:** [SEP] (ID: 102)
- **Mask Token:** [MASK] (ID: 103)

### Training Hyperparameters

- **Learning Rate:** 2e-5
- **Batch Size:** 16 (per device)
- **Number of Epochs:** 3
- **Weight Decay:** 0.01
- **Optimizer:** AdamW (default)
- **Training Platform:** Kaggle with GPU T4 x2
- **Training Time:** ~1-2 hours

### Evaluation Strategy

- **Evaluation Metric:** SeqEval (standard for NER tasks)
- **Evaluation Strategy:** Every epoch
- **Metrics Tracked:**
  - Precision
  - Recall
  - F1 Score
  - Accuracy

## Evaluation

The model should be evaluated on appropriate PII detection benchmarks using standard NER metrics (F1, Precision, Recall) for each entity type.

## Limitations and Bias

- The model's performance may vary across different text domains and writing styles
- May not generalize well to PII formats from countries/regions not well-represented in training data
- Context-dependent entities (e.g., names that are also common words) may be challenging
- The model may have biases present in the training data
- Should not be used as the sole method for PII detection in critical applications without human review

## Ethical Considerations

This model is designed to help protect privacy by detecting PII in text. However:

- The model is not perfect and may miss some PII (false negatives) or incorrectly flag non-PII (false positives)
- Should be used as part of a comprehensive privacy protection strategy
- Users should be aware of applicable privacy regulations (GDPR, CCPA, etc.)
- The model's use should comply with all relevant laws and regulations
- Consider the implications of automated PII detection in your specific use case

## ONNX Runtime Compatibility

This model is compatible with ONNX Runtime and can be deployed on:
- CPU (optimized for inference)
- GPU (CUDA)
- Edge devices
- Web browsers (via ONNX.js)
- Mobile devices (iOS/Android)

## File Structure

```
.
├── README.md                    # This file
├── config.json                  # Model configuration
├── tokenizer_config.json        # Tokenizer configuration
├── tokenizer.json              # Fast tokenizer
├── vocab.txt                   # Vocabulary file
├── special_tokens_map.json     # Special tokens mapping
└── onnx/
    └── model.onnx              # ONNX model file
```

## Citation

If you use this model in your research or application, please cite:

```bibtex
@misc{bert-pii-onnx,
  title={BERT PII Detection Model (ONNX)},
  author={Your Name/Organization},
  year={2025},
  howpublished={\url{https://huggingface.co/your-username/bert-pii-onnx}}
}
```

### Base Model Citation

This model is based on BERT. Please also cite the original BERT paper:

```bibtex
@article{devlin2018bert,
  title={BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding},
  author={Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina},
  journal={arXiv preprint arXiv:1810.04805},
  year={2018}
}
```

## Contact

For questions, issues, or feedback about this model, please open an issue in the model repository.

## Acknowledgments

### Base Model
This model is built upon **BERT (Bidirectional Encoder Representations from Transformers)** developed by Google Research:
- Original BERT paper: [Devlin et al., 2018](https://arxiv.org/abs/1810.04805)
- BERT is licensed under Apache 2.0

### Dataset
The model was trained on **ai4privacy/pii-masking-300k**:
- Dataset: [ai4privacy/pii-masking-300k](https://huggingface.co/datasets/ai4privacy/pii-masking-300k)
- Creator: ai4privacy team on Hugging Face
- Size: 300,000 examples with PII annotations
- Please cite the dataset creators if you use this model

```bibtex
@misc{ai4privacy-pii-dataset,
  title={PII Masking 300K Dataset},
  author={ai4privacy},
  year={2024},
  howpublished={\url{https://huggingface.co/datasets/ai4privacy/pii-masking-300k}}
}
```

### Technologies
- **Transformers Library**: [Hugging Face](https://github.com/huggingface/transformers)
- **ONNX**: [Open Neural Network Exchange](https://onnx.ai/) for cross-platform model deployment
- **ONNX Runtime**: [Microsoft ONNX Runtime](https://onnxruntime.ai/) for efficient inference

### Special Thanks
- Hugging Face team for the Transformers library and model hub infrastructure
- ONNX community for standardized model format and runtime
- Contributors to the training dataset (if applicable)