BART-Base OCR Error Correction (Synthetic Data + Real Vampyre Text)

This model is a fine-tuned version of facebook/bart-base for correcting OCR errors in historical texts, specifically trained on "The Vampyre" dataset.

🎯 Model Description

Base Model: facebook/bart-base
Task: OCR error correction
Training Strategy:
- Train/Val: Synthetic OCR data (1020 samples with GPT-4 generated errors)
- Test: Real OCR data from "The Vampyre" (300 samples)
Best Checkpoint: Epoch 2
Validation CER: 14.49%
Validation WER: 37.99%

📊 Performance

Evaluated on real historical OCR text from "The Vampyre":

Metric	Score
Character Error Rate (CER)	14.49%
Word Error Rate (WER)	37.99%
Exact Match	0.0%

🚀 Usage

Quick Start

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("ejaz111/bart-synthetic-data-vampyre-ocr-correction")
model = AutoModelForSeq2SeqLM.from_pretrained("ejaz111/bart-synthetic-data-vampyre-ocr-correction")

# Correct OCR errors
ocr_text = "Th1s 1s an 0CR err0r w1th m1stakes in the anc1ent text."
input_ids = tokenizer(ocr_text, return_tensors="pt", max_length=512, truncation=True).input_ids
outputs = model.generate(input_ids, max_length=512)
corrected_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(f"Original:  {ocr_text}")
print(f"Corrected: {corrected_text}")

Using Pipeline

from transformers import pipeline

corrector = pipeline("text2text-generation", model="ejaz111/bart-synthetic-data-vampyre-ocr-correction")
result = corrector("The breeze wh15pered so7tly through the anci3nt tre55")[0]['generated_text']
print(result)
# Output: "The breeze whispered softly through the ancient trees"

🎓 Training Details

Training Data

Synthetic Data (Train/Val): 1020 samples
- 85% training (~867 samples)
- 15% validation (~153 samples)
- Generated using GPT-4 with 20 corruption strategies
Real Data (Test): 300 samples from "The Vampyre" OCR text
No data leakage: Test set contains only real OCR data, never seen during training

Training Configuration

Epochs: 20 (best model at epoch 2)
Batch Size: 16
Learning Rate: 1e-4
Optimizer: AdamW with weight decay 0.01
Scheduler: Linear with warmup (10% warmup steps)
Max Sequence Length: 512 tokens
Architecture: BART encoder-decoder with 139M parameters
Training Time: ~30 minutes on GPU

Corruption Strategies (Training Data)

The synthetic training data included these OCR error types:

Character substitutions (visual similarity)
Missing/extra characters
Word boundary errors
Case errors
Punctuation errors
Long s (ſ) substitutions
Historical typography errors

📈 Training Progress

The model showed rapid improvement in early epochs:

Epoch 1: CER 16.62%
Epoch 2: CER 14.49% ⭐ (Best)
Epoch 3: CER 15.86%
Later epochs showed overfitting with CER rising to ~20%

The best checkpoint from epoch 2 was saved and is the one available in this repository.

💡 Use Cases

This model is particularly effective for:

Correcting OCR errors in historical documents
Post-processing digitized manuscripts
Cleaning text from scanned historical books
Literary text restoration
Academic research on historical texts

⚠️ Limitations

Optimized for English historical texts
Best performance on texts similar to 19th-century literature
May struggle with extremely degraded or non-standard OCR
Maximum input length: 512 tokens
Higher WER compared to T5 baseline (37.99% vs 22.52%)

🔬 Model Comparison

Model	CER	WER	Parameters
BART-base	14.49%	37.99%	139M
T5-base	13.93%	22.52%	220M

BART achieves slightly better character-level accuracy but struggles more with word-level corrections.

🔬 Evaluation Examples

Original OCR	Corrected Output
"Th1s 1s an 0CR err0r"	"This is an OCR error"
"The anci3nt tre55"	"The ancient trees"
"bl0omiNg floweRs"	"blooming flowers"

📚 Citation

If you use this model in your research, please cite:

@misc{bart-vampyre-ocr,
  author = {Ejaz},
  title = {BART Base OCR Error Correction for Historical Texts},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/ejaz111/bart-synthetic-data-vampyre-ocr-correction}}
}

👤 Author

Ejaz - Master's Student in AI and Robotics

📄 License

Apache 2.0

🙏 Acknowledgments

Base model: facebook/bart-base
Training data: "The Vampyre" by John William Polidori
Synthetic data generation: GPT-4
Companion model: ejaz111/t5-synthetic-data-vampyre-ocr-correction

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Evaluation results

Character Error Rate on The Vampyre (Synthetic + Real)
self-reported

14.490
Word Error Rate on The Vampyre (Synthetic + Real)
self-reported

37.990

View on Papers With Code