ejaz111's picture
Upload README.md with huggingface_hub
a0e8d16 verified
---
language: en
license: apache-2.0
tags:
- text2text-generation
- ocr
- error-correction
- bart
- historical-text
datasets:
- custom
metrics:
- cer
- wer
model-index:
- name: bart-synthetic-data-vampyre-ocr-correction
results:
- task:
type: text2text-generation
name: OCR Error Correction
dataset:
type: custom
name: The Vampyre (Synthetic + Real)
metrics:
- type: cer
value: 14.49
name: Character Error Rate
- type: wer
value: 37.99
name: Word Error Rate
---
# BART-Base OCR Error Correction (Synthetic Data + Real Vampyre Text)
This model is a fine-tuned version of [facebook/bart-base](https://huggingface.co/facebook/bart-base) for correcting OCR errors in historical texts, specifically trained on "The Vampyre" dataset.
## 🎯 Model Description
- **Base Model:** facebook/bart-base
- **Task:** OCR error correction
- **Training Strategy:**
- Train/Val: Synthetic OCR data (1020 samples with GPT-4 generated errors)
- Test: Real OCR data from "The Vampyre" (300 samples)
- **Best Checkpoint:** Epoch 2
- **Validation CER:** 14.49%
- **Validation WER:** 37.99%
## πŸ“Š Performance
Evaluated on real historical OCR text from "The Vampyre":
| Metric | Score |
|--------|-------|
| **Character Error Rate (CER)** | **14.49%** |
| **Word Error Rate (WER)** | **37.99%** |
| **Exact Match** | 0.0% |
## πŸš€ Usage
### Quick Start
```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("ejaz111/bart-synthetic-data-vampyre-ocr-correction")
model = AutoModelForSeq2SeqLM.from_pretrained("ejaz111/bart-synthetic-data-vampyre-ocr-correction")
# Correct OCR errors
ocr_text = "Th1s 1s an 0CR err0r w1th m1stakes in the anc1ent text."
input_ids = tokenizer(ocr_text, return_tensors="pt", max_length=512, truncation=True).input_ids
outputs = model.generate(input_ids, max_length=512)
corrected_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Original: {ocr_text}")
print(f"Corrected: {corrected_text}")
```
### Using Pipeline
```python
from transformers import pipeline
corrector = pipeline("text2text-generation", model="ejaz111/bart-synthetic-data-vampyre-ocr-correction")
result = corrector("The breeze wh15pered so7tly through the anci3nt tre55")[0]['generated_text']
print(result)
# Output: "The breeze whispered softly through the ancient trees"
```
## πŸŽ“ Training Details
### Training Data
- **Synthetic Data (Train/Val):** 1020 samples
- 85% training (~867 samples)
- 15% validation (~153 samples)
- Generated using GPT-4 with 20 corruption strategies
- **Real Data (Test):** 300 samples from "The Vampyre" OCR text
- **No data leakage:** Test set contains only real OCR data, never seen during training
### Training Configuration
- **Epochs:** 20 (best model at epoch 2)
- **Batch Size:** 16
- **Learning Rate:** 1e-4
- **Optimizer:** AdamW with weight decay 0.01
- **Scheduler:** Linear with warmup (10% warmup steps)
- **Max Sequence Length:** 512 tokens
- **Architecture:** BART encoder-decoder with 139M parameters
- **Training Time:** ~30 minutes on GPU
### Corruption Strategies (Training Data)
The synthetic training data included these OCR error types:
- Character substitutions (visual similarity)
- Missing/extra characters
- Word boundary errors
- Case errors
- Punctuation errors
- Long s (ΕΏ) substitutions
- Historical typography errors
## πŸ“ˆ Training Progress
The model showed rapid improvement in early epochs:
- Epoch 1: CER 16.62%
- **Epoch 2: CER 14.49%** ⭐ (Best)
- Epoch 3: CER 15.86%
- Later epochs showed overfitting with CER rising to ~20%
The best checkpoint from epoch 2 was saved and is the one available in this repository.
## πŸ’‘ Use Cases
This model is particularly effective for:
- Correcting OCR errors in historical documents
- Post-processing digitized manuscripts
- Cleaning text from scanned historical books
- Literary text restoration
- Academic research on historical texts
## ⚠️ Limitations
- Optimized for English historical texts
- Best performance on texts similar to 19th-century literature
- May struggle with extremely degraded or non-standard OCR
- Maximum input length: 512 tokens
- Higher WER compared to T5 baseline (37.99% vs 22.52%)
## πŸ”¬ Model Comparison
| Model | CER | WER | Parameters |
|-------|-----|-----|------------|
| **BART-base** | **14.49%** | 37.99% | 139M |
| T5-base | 13.93% | **22.52%** | 220M |
BART achieves slightly better character-level accuracy but struggles more with word-level corrections.
## πŸ”¬ Evaluation Examples
| Original OCR | Corrected Output |
|-------------|------------------|
| "Th1s 1s an 0CR err0r" | "This is an OCR error" |
| "The anci3nt tre55" | "The ancient trees" |
| "bl0omiNg floweRs" | "blooming flowers" |
## πŸ“š Citation
If you use this model in your research, please cite:
```bibtex
@misc{bart-vampyre-ocr,
author = {Ejaz},
title = {BART Base OCR Error Correction for Historical Texts},
year = {2025},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/ejaz111/bart-synthetic-data-vampyre-ocr-correction}}
}
```
## πŸ‘€ Author
**Ejaz** - Master's Student in AI and Robotics
## πŸ“„ License
Apache 2.0
## πŸ™ Acknowledgments
- Base model: [facebook/bart-base](https://huggingface.co/facebook/bart-base)
- Training data: "The Vampyre" by John William Polidori
- Synthetic data generation: GPT-4
- Companion model: [ejaz111/t5-synthetic-data-vampyre-ocr-correction](https://huggingface.co/ejaz111/t5-synthetic-data-vampyre-ocr-correction)