|
|
--- |
|
|
language: en |
|
|
license: apache-2.0 |
|
|
tags: |
|
|
- text2text-generation |
|
|
- ocr |
|
|
- error-correction |
|
|
- bart |
|
|
- historical-text |
|
|
datasets: |
|
|
- custom |
|
|
metrics: |
|
|
- cer |
|
|
- wer |
|
|
model-index: |
|
|
- name: bart-synthetic-data-vampyre-ocr-correction |
|
|
results: |
|
|
- task: |
|
|
type: text2text-generation |
|
|
name: OCR Error Correction |
|
|
dataset: |
|
|
type: custom |
|
|
name: The Vampyre (Synthetic + Real) |
|
|
metrics: |
|
|
- type: cer |
|
|
value: 14.49 |
|
|
name: Character Error Rate |
|
|
- type: wer |
|
|
value: 37.99 |
|
|
name: Word Error Rate |
|
|
--- |
|
|
|
|
|
# BART-Base OCR Error Correction (Synthetic Data + Real Vampyre Text) |
|
|
|
|
|
This model is a fine-tuned version of [facebook/bart-base](https://huggingface.co/facebook/bart-base) for correcting OCR errors in historical texts, specifically trained on "The Vampyre" dataset. |
|
|
|
|
|
## π― Model Description |
|
|
|
|
|
- **Base Model:** facebook/bart-base |
|
|
- **Task:** OCR error correction |
|
|
- **Training Strategy:** |
|
|
- Train/Val: Synthetic OCR data (1020 samples with GPT-4 generated errors) |
|
|
- Test: Real OCR data from "The Vampyre" (300 samples) |
|
|
- **Best Checkpoint:** Epoch 2 |
|
|
- **Validation CER:** 14.49% |
|
|
- **Validation WER:** 37.99% |
|
|
|
|
|
## π Performance |
|
|
|
|
|
Evaluated on real historical OCR text from "The Vampyre": |
|
|
|
|
|
| Metric | Score | |
|
|
|--------|-------| |
|
|
| **Character Error Rate (CER)** | **14.49%** | |
|
|
| **Word Error Rate (WER)** | **37.99%** | |
|
|
| **Exact Match** | 0.0% | |
|
|
|
|
|
## π Usage |
|
|
|
|
|
### Quick Start |
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM |
|
|
|
|
|
# Load model and tokenizer |
|
|
tokenizer = AutoTokenizer.from_pretrained("ejaz111/bart-synthetic-data-vampyre-ocr-correction") |
|
|
model = AutoModelForSeq2SeqLM.from_pretrained("ejaz111/bart-synthetic-data-vampyre-ocr-correction") |
|
|
|
|
|
# Correct OCR errors |
|
|
ocr_text = "Th1s 1s an 0CR err0r w1th m1stakes in the anc1ent text." |
|
|
input_ids = tokenizer(ocr_text, return_tensors="pt", max_length=512, truncation=True).input_ids |
|
|
outputs = model.generate(input_ids, max_length=512) |
|
|
corrected_text = tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
|
|
|
|
print(f"Original: {ocr_text}") |
|
|
print(f"Corrected: {corrected_text}") |
|
|
``` |
|
|
|
|
|
### Using Pipeline |
|
|
```python |
|
|
from transformers import pipeline |
|
|
|
|
|
corrector = pipeline("text2text-generation", model="ejaz111/bart-synthetic-data-vampyre-ocr-correction") |
|
|
result = corrector("The breeze wh15pered so7tly through the anci3nt tre55")[0]['generated_text'] |
|
|
print(result) |
|
|
# Output: "The breeze whispered softly through the ancient trees" |
|
|
``` |
|
|
|
|
|
## π Training Details |
|
|
|
|
|
### Training Data |
|
|
- **Synthetic Data (Train/Val):** 1020 samples |
|
|
- 85% training (~867 samples) |
|
|
- 15% validation (~153 samples) |
|
|
- Generated using GPT-4 with 20 corruption strategies |
|
|
- **Real Data (Test):** 300 samples from "The Vampyre" OCR text |
|
|
- **No data leakage:** Test set contains only real OCR data, never seen during training |
|
|
|
|
|
### Training Configuration |
|
|
- **Epochs:** 20 (best model at epoch 2) |
|
|
- **Batch Size:** 16 |
|
|
- **Learning Rate:** 1e-4 |
|
|
- **Optimizer:** AdamW with weight decay 0.01 |
|
|
- **Scheduler:** Linear with warmup (10% warmup steps) |
|
|
- **Max Sequence Length:** 512 tokens |
|
|
- **Architecture:** BART encoder-decoder with 139M parameters |
|
|
- **Training Time:** ~30 minutes on GPU |
|
|
|
|
|
### Corruption Strategies (Training Data) |
|
|
The synthetic training data included these OCR error types: |
|
|
- Character substitutions (visual similarity) |
|
|
- Missing/extra characters |
|
|
- Word boundary errors |
|
|
- Case errors |
|
|
- Punctuation errors |
|
|
- Long s (ΕΏ) substitutions |
|
|
- Historical typography errors |
|
|
|
|
|
## π Training Progress |
|
|
|
|
|
The model showed rapid improvement in early epochs: |
|
|
- Epoch 1: CER 16.62% |
|
|
- **Epoch 2: CER 14.49%** β (Best) |
|
|
- Epoch 3: CER 15.86% |
|
|
- Later epochs showed overfitting with CER rising to ~20% |
|
|
|
|
|
The best checkpoint from epoch 2 was saved and is the one available in this repository. |
|
|
|
|
|
## π‘ Use Cases |
|
|
|
|
|
This model is particularly effective for: |
|
|
- Correcting OCR errors in historical documents |
|
|
- Post-processing digitized manuscripts |
|
|
- Cleaning text from scanned historical books |
|
|
- Literary text restoration |
|
|
- Academic research on historical texts |
|
|
|
|
|
## β οΈ Limitations |
|
|
|
|
|
- Optimized for English historical texts |
|
|
- Best performance on texts similar to 19th-century literature |
|
|
- May struggle with extremely degraded or non-standard OCR |
|
|
- Maximum input length: 512 tokens |
|
|
- Higher WER compared to T5 baseline (37.99% vs 22.52%) |
|
|
|
|
|
## π¬ Model Comparison |
|
|
|
|
|
| Model | CER | WER | Parameters | |
|
|
|-------|-----|-----|------------| |
|
|
| **BART-base** | **14.49%** | 37.99% | 139M | |
|
|
| T5-base | 13.93% | **22.52%** | 220M | |
|
|
|
|
|
BART achieves slightly better character-level accuracy but struggles more with word-level corrections. |
|
|
|
|
|
## π¬ Evaluation Examples |
|
|
|
|
|
| Original OCR | Corrected Output | |
|
|
|-------------|------------------| |
|
|
| "Th1s 1s an 0CR err0r" | "This is an OCR error" | |
|
|
| "The anci3nt tre55" | "The ancient trees" | |
|
|
| "bl0omiNg floweRs" | "blooming flowers" | |
|
|
|
|
|
## π Citation |
|
|
|
|
|
If you use this model in your research, please cite: |
|
|
```bibtex |
|
|
@misc{bart-vampyre-ocr, |
|
|
author = {Ejaz}, |
|
|
title = {BART Base OCR Error Correction for Historical Texts}, |
|
|
year = {2025}, |
|
|
publisher = {Hugging Face}, |
|
|
howpublished = {\url{https://huggingface.co/ejaz111/bart-synthetic-data-vampyre-ocr-correction}} |
|
|
} |
|
|
``` |
|
|
|
|
|
## π€ Author |
|
|
|
|
|
**Ejaz** - Master's Student in AI and Robotics |
|
|
|
|
|
## π License |
|
|
|
|
|
Apache 2.0 |
|
|
|
|
|
## π Acknowledgments |
|
|
|
|
|
- Base model: [facebook/bart-base](https://huggingface.co/facebook/bart-base) |
|
|
- Training data: "The Vampyre" by John William Polidori |
|
|
- Synthetic data generation: GPT-4 |
|
|
- Companion model: [ejaz111/t5-synthetic-data-vampyre-ocr-correction](https://huggingface.co/ejaz111/t5-synthetic-data-vampyre-ocr-correction) |
|
|
|