BART-Base OCR Error Correction (Synthetic Data + Real Vampyre Text)
This model is a fine-tuned version of facebook/bart-base for correcting OCR errors in historical texts, specifically trained on "The Vampyre" dataset.
π― Model Description
- Base Model: facebook/bart-base
- Task: OCR error correction
- Training Strategy:
- Train/Val: Synthetic OCR data (1020 samples with GPT-4 generated errors)
- Test: Real OCR data from "The Vampyre" (300 samples)
- Best Checkpoint: Epoch 2
- Validation CER: 14.49%
- Validation WER: 37.99%
π Performance
Evaluated on real historical OCR text from "The Vampyre":
| Metric | Score |
|---|---|
| Character Error Rate (CER) | 14.49% |
| Word Error Rate (WER) | 37.99% |
| Exact Match | 0.0% |
π Usage
Quick Start
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("ejaz111/bart-synthetic-data-vampyre-ocr-correction")
model = AutoModelForSeq2SeqLM.from_pretrained("ejaz111/bart-synthetic-data-vampyre-ocr-correction")
# Correct OCR errors
ocr_text = "Th1s 1s an 0CR err0r w1th m1stakes in the anc1ent text."
input_ids = tokenizer(ocr_text, return_tensors="pt", max_length=512, truncation=True).input_ids
outputs = model.generate(input_ids, max_length=512)
corrected_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Original: {ocr_text}")
print(f"Corrected: {corrected_text}")
Using Pipeline
from transformers import pipeline
corrector = pipeline("text2text-generation", model="ejaz111/bart-synthetic-data-vampyre-ocr-correction")
result = corrector("The breeze wh15pered so7tly through the anci3nt tre55")[0]['generated_text']
print(result)
# Output: "The breeze whispered softly through the ancient trees"
π Training Details
Training Data
- Synthetic Data (Train/Val): 1020 samples
- 85% training (~867 samples)
- 15% validation (~153 samples)
- Generated using GPT-4 with 20 corruption strategies
- Real Data (Test): 300 samples from "The Vampyre" OCR text
- No data leakage: Test set contains only real OCR data, never seen during training
Training Configuration
- Epochs: 20 (best model at epoch 2)
- Batch Size: 16
- Learning Rate: 1e-4
- Optimizer: AdamW with weight decay 0.01
- Scheduler: Linear with warmup (10% warmup steps)
- Max Sequence Length: 512 tokens
- Architecture: BART encoder-decoder with 139M parameters
- Training Time: ~30 minutes on GPU
Corruption Strategies (Training Data)
The synthetic training data included these OCR error types:
- Character substitutions (visual similarity)
- Missing/extra characters
- Word boundary errors
- Case errors
- Punctuation errors
- Long s (ΕΏ) substitutions
- Historical typography errors
π Training Progress
The model showed rapid improvement in early epochs:
- Epoch 1: CER 16.62%
- Epoch 2: CER 14.49% β (Best)
- Epoch 3: CER 15.86%
- Later epochs showed overfitting with CER rising to ~20%
The best checkpoint from epoch 2 was saved and is the one available in this repository.
π‘ Use Cases
This model is particularly effective for:
- Correcting OCR errors in historical documents
- Post-processing digitized manuscripts
- Cleaning text from scanned historical books
- Literary text restoration
- Academic research on historical texts
β οΈ Limitations
- Optimized for English historical texts
- Best performance on texts similar to 19th-century literature
- May struggle with extremely degraded or non-standard OCR
- Maximum input length: 512 tokens
- Higher WER compared to T5 baseline (37.99% vs 22.52%)
π¬ Model Comparison
| Model | CER | WER | Parameters |
|---|---|---|---|
| BART-base | 14.49% | 37.99% | 139M |
| T5-base | 13.93% | 22.52% | 220M |
BART achieves slightly better character-level accuracy but struggles more with word-level corrections.
π¬ Evaluation Examples
| Original OCR | Corrected Output |
|---|---|
| "Th1s 1s an 0CR err0r" | "This is an OCR error" |
| "The anci3nt tre55" | "The ancient trees" |
| "bl0omiNg floweRs" | "blooming flowers" |
π Citation
If you use this model in your research, please cite:
@misc{bart-vampyre-ocr,
author = {Ejaz},
title = {BART Base OCR Error Correction for Historical Texts},
year = {2025},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/ejaz111/bart-synthetic-data-vampyre-ocr-correction}}
}
π€ Author
Ejaz - Master's Student in AI and Robotics
π License
Apache 2.0
π Acknowledgments
- Base model: facebook/bart-base
- Training data: "The Vampyre" by John William Polidori
- Synthetic data generation: GPT-4
- Companion model: ejaz111/t5-synthetic-data-vampyre-ocr-correction
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support
Evaluation results
- Character Error Rate on The Vampyre (Synthetic + Real)self-reported14.490
- Word Error Rate on The Vampyre (Synthetic + Real)self-reported37.990