ejaz111
/

bart-synthetic-data-vampyre-ocr-correction

+---
+language: en
+license: apache-2.0
+tags:
+- text2text-generation
+- ocr
+- error-correction
+- bart
+- historical-text
+datasets:
+- custom
+metrics:
+- cer
+- wer
+model-index:
+- name: bart-synthetic-data-vampyre-ocr-correction
+  results:
+  - task:
+      type: text2text-generation
+      name: OCR Error Correction
+    dataset:
+      type: custom
+      name: The Vampyre (Synthetic + Real)
+    metrics:
+    - type: cer
+      value: 14.49
+      name: Character Error Rate
+    - type: wer
+      value: 37.99
+      name: Word Error Rate
+---
+# BART-Base OCR Error Correction (Synthetic Data + Real Vampyre Text)
+This model is a fine-tuned version of [facebook/bart-base](https://huggingface.co/facebook/bart-base) for correcting OCR errors in historical texts, specifically trained on "The Vampyre" dataset.
+## 🎯 Model Description
+- **Base Model:** facebook/bart-base
+- **Task:** OCR error correction
+- **Training Strategy:**
+  - Train/Val: Synthetic OCR data (1020 samples with GPT-4 generated errors)
+  - Test: Real OCR data from "The Vampyre" (300 samples)
+- **Best Checkpoint:** Epoch 2
+- **Validation CER:** 14.49%
+- **Validation WER:** 37.99%
+## 📊 Performance
+Evaluated on real historical OCR text from "The Vampyre":
+| Metric | Score |
+|--------|-------|
+| **Character Error Rate (CER)** | **14.49%** |
+| **Word Error Rate (WER)** | **37.99%** |
+| **Exact Match** | 0.0% |
+## 🚀 Usage
+### Quick Start
+```python
+from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
+# Load model and tokenizer
+tokenizer = AutoTokenizer.from_pretrained("ejaz111/bart-synthetic-data-vampyre-ocr-correction")
+model = AutoModelForSeq2SeqLM.from_pretrained("ejaz111/bart-synthetic-data-vampyre-ocr-correction")
+# Correct OCR errors
+ocr_text = "Th1s 1s an 0CR err0r w1th m1stakes in the anc1ent text."
+input_ids = tokenizer(ocr_text, return_tensors="pt", max_length=512, truncation=True).input_ids
+outputs = model.generate(input_ids, max_length=512)
+corrected_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
+print(f"Original:  {ocr_text}")
+print(f"Corrected: {corrected_text}")
+```
+### Using Pipeline
+```python
+from transformers import pipeline
+corrector = pipeline("text2text-generation", model="ejaz111/bart-synthetic-data-vampyre-ocr-correction")
+result = corrector("The breeze wh15pered so7tly through the anci3nt tre55")[0]['generated_text']
+print(result)
+# Output: "The breeze whispered softly through the ancient trees"
+```
+## 🎓 Training Details
+### Training Data
+- **Synthetic Data (Train/Val):** 1020 samples
+  - 85% training (~867 samples)
+  - 15% validation (~153 samples)
+  - Generated using GPT-4 with 20 corruption strategies
+- **Real Data (Test):** 300 samples from "The Vampyre" OCR text
+- **No data leakage:** Test set contains only real OCR data, never seen during training
+### Training Configuration
+- **Epochs:** 20 (best model at epoch 2)
+- **Batch Size:** 16
+- **Learning Rate:** 1e-4
+- **Optimizer:** AdamW with weight decay 0.01
+- **Scheduler:** Linear with warmup (10% warmup steps)
+- **Max Sequence Length:** 512 tokens
+- **Architecture:** BART encoder-decoder with 139M parameters
+- **Training Time:** ~30 minutes on GPU
+### Corruption Strategies (Training Data)
+The synthetic training data included these OCR error types:
+- Character substitutions (visual similarity)
+- Missing/extra characters
+- Word boundary errors
+- Case errors
+- Punctuation errors
+- Long s (ſ) substitutions
+- Historical typography errors
+## 📈 Training Progress
+The model showed rapid improvement in early epochs:
+- Epoch 1: CER 16.62%
+- **Epoch 2: CER 14.49%** ⭐ (Best)
+- Epoch 3: CER 15.86%
+- Later epochs showed overfitting with CER rising to ~20%
+The best checkpoint from epoch 2 was saved and is the one available in this repository.
+## 💡 Use Cases
+This model is particularly effective for:
+- Correcting OCR errors in historical documents
+- Post-processing digitized manuscripts
+- Cleaning text from scanned historical books
+- Literary text restoration
+- Academic research on historical texts
+## ⚠️ Limitations
+- Optimized for English historical texts
+- Best performance on texts similar to 19th-century literature
+- May struggle with extremely degraded or non-standard OCR
+- Maximum input length: 512 tokens
+- Higher WER compared to T5 baseline (37.99% vs 22.52%)
+## 🔬 Model Comparison
+| Model | CER | WER | Parameters |
+|-------|-----|-----|------------|
+| **BART-base** | **14.49%** | 37.99% | 139M |
+| T5-base | 13.93% | **22.52%** | 220M |
+BART achieves slightly better character-level accuracy but struggles more with word-level corrections.
+## 🔬 Evaluation Examples
+| Original OCR | Corrected Output |
+|-------------|------------------|
+| "Th1s 1s an 0CR err0r" | "This is an OCR error" |
+| "The anci3nt tre55" | "The ancient trees" |
+| "bl0omiNg floweRs" | "blooming flowers" |
+## 📚 Citation
+If you use this model in your research, please cite:
+```bibtex
+@misc{bart-vampyre-ocr,
+  author = {Ejaz},
+  title = {BART Base OCR Error Correction for Historical Texts},
+  year = {2025},
+  publisher = {Hugging Face},
+  howpublished = {\url{https://huggingface.co/ejaz111/bart-synthetic-data-vampyre-ocr-correction}}
+}
+```
+## 👤 Author
+**Ejaz** - Master's Student in AI and Robotics
+## 📄 License
+Apache 2.0
+## 🙏 Acknowledgments
+- Base model: [facebook/bart-base](https://huggingface.co/facebook/bart-base)
+- Training data: "The Vampyre" by John William Polidori
+- Synthetic data generation: GPT-4
+- Companion model: [ejaz111/t5-synthetic-data-vampyre-ocr-correction](https://huggingface.co/ejaz111/t5-synthetic-data-vampyre-ocr-correction)