Upload README.md with huggingface_hub

a0e8d16 verified 14 days ago

5.62 kB

	---
	language: en
	license: apache-2.0
	tags:
	- text2text-generation
	- ocr
	- error-correction
	- bart
	- historical-text
	datasets:
	- custom
	metrics:
	- cer
	- wer
	model-index:
	- name: bart-synthetic-data-vampyre-ocr-correction
	results:
	- task:
	type: text2text-generation
	name: OCR Error Correction
	dataset:
	type: custom
	name: The Vampyre (Synthetic + Real)
	metrics:
	- type: cer
	value: 14.49
	name: Character Error Rate
	- type: wer
	value: 37.99
	name: Word Error Rate
	---

	# BART-Base OCR Error Correction (Synthetic Data + Real Vampyre Text)

	This model is a fine-tuned version of [facebook/bart-base](https://huggingface.co/facebook/bart-base) for correcting OCR errors in historical texts, specifically trained on "The Vampyre" dataset.

	## 🎯 Model Description

	- Base Model: facebook/bart-base
	- Task: OCR error correction
	- Training Strategy:
	- Train/Val: Synthetic OCR data (1020 samples with GPT-4 generated errors)
	- Test: Real OCR data from "The Vampyre" (300 samples)
	- Best Checkpoint: Epoch 2
	- Validation CER: 14.49%
	- Validation WER: 37.99%

	## 📊 Performance

	Evaluated on real historical OCR text from "The Vampyre":

	\| Metric \| Score \|
	\|--------\|-------\|
	\| Character Error Rate (CER) \| 14.49% \|
	\| Word Error Rate (WER) \| 37.99% \|
	\| Exact Match \| 0.0% \|

	## 🚀 Usage

	### Quick Start
	```python
	from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

	# Load model and tokenizer
	tokenizer = AutoTokenizer.from_pretrained("ejaz111/bart-synthetic-data-vampyre-ocr-correction")
	model = AutoModelForSeq2SeqLM.from_pretrained("ejaz111/bart-synthetic-data-vampyre-ocr-correction")

	# Correct OCR errors
	ocr_text = "Th1s 1s an 0CR err0r w1th m1stakes in the anc1ent text."
	input_ids = tokenizer(ocr_text, return_tensors="pt", max_length=512, truncation=True).input_ids
	outputs = model.generate(input_ids, max_length=512)
	corrected_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

	print(f"Original: {ocr_text}")
	print(f"Corrected: {corrected_text}")
	```

	### Using Pipeline
	```python
	from transformers import pipeline

	corrector = pipeline("text2text-generation", model="ejaz111/bart-synthetic-data-vampyre-ocr-correction")
	result = corrector("The breeze wh15pered so7tly through the anci3nt tre55")[0]['generated_text']
	print(result)
	# Output: "The breeze whispered softly through the ancient trees"
	```

	## 🎓 Training Details

	### Training Data
	- Synthetic Data (Train/Val): 1020 samples
	- 85% training (~867 samples)
	- 15% validation (~153 samples)
	- Generated using GPT-4 with 20 corruption strategies
	- Real Data (Test): 300 samples from "The Vampyre" OCR text
	- No data leakage: Test set contains only real OCR data, never seen during training

	### Training Configuration
	- Epochs: 20 (best model at epoch 2)
	- Batch Size: 16
	- Learning Rate: 1e-4
	- Optimizer: AdamW with weight decay 0.01
	- Scheduler: Linear with warmup (10% warmup steps)
	- Max Sequence Length: 512 tokens
	- Architecture: BART encoder-decoder with 139M parameters
	- Training Time: ~30 minutes on GPU

	### Corruption Strategies (Training Data)
	The synthetic training data included these OCR error types:
	- Character substitutions (visual similarity)
	- Missing/extra characters
	- Word boundary errors
	- Case errors
	- Punctuation errors
	- Long s (ſ) substitutions
	- Historical typography errors

	## 📈 Training Progress

	The model showed rapid improvement in early epochs:
	- Epoch 1: CER 16.62%
	- Epoch 2: CER 14.49% ⭐ (Best)
	- Epoch 3: CER 15.86%
	- Later epochs showed overfitting with CER rising to ~20%

	The best checkpoint from epoch 2 was saved and is the one available in this repository.

	## 💡 Use Cases

	This model is particularly effective for:
	- Correcting OCR errors in historical documents
	- Post-processing digitized manuscripts
	- Cleaning text from scanned historical books
	- Literary text restoration
	- Academic research on historical texts

	## ⚠️ Limitations

	- Optimized for English historical texts
	- Best performance on texts similar to 19th-century literature
	- May struggle with extremely degraded or non-standard OCR
	- Maximum input length: 512 tokens
	- Higher WER compared to T5 baseline (37.99% vs 22.52%)

	## 🔬 Model Comparison

	\| Model \| CER \| WER \| Parameters \|
	\|-------\|-----\|-----\|------------\|
	\| BART-base \| 14.49% \| 37.99% \| 139M \|
	\| T5-base \| 13.93% \| 22.52% \| 220M \|

	BART achieves slightly better character-level accuracy but struggles more with word-level corrections.

	## 🔬 Evaluation Examples

	\| Original OCR \| Corrected Output \|
	\|-------------\|------------------\|
	\| "Th1s 1s an 0CR err0r" \| "This is an OCR error" \|
	\| "The anci3nt tre55" \| "The ancient trees" \|
	\| "bl0omiNg floweRs" \| "blooming flowers" \|

	## 📚 Citation

	If you use this model in your research, please cite:
	```bibtex
	@misc{bart-vampyre-ocr,
	author = {Ejaz},
	title = {BART Base OCR Error Correction for Historical Texts},
	year = {2025},
	publisher = {Hugging Face},
	howpublished = {\url{https://huggingface.co/ejaz111/bart-synthetic-data-vampyre-ocr-correction}}
	}
	```

	## 👤 Author

	Ejaz - Master's Student in AI and Robotics

	## 📄 License

	Apache 2.0

	## 🙏 Acknowledgments

	- Base model: [facebook/bart-base](https://huggingface.co/facebook/bart-base)
	- Training data: "The Vampyre" by John William Polidori
	- Synthetic data generation: GPT-4
	- Companion model: [ejaz111/t5-synthetic-data-vampyre-ocr-correction](https://huggingface.co/ejaz111/t5-synthetic-data-vampyre-ocr-correction)