ejaz111 commited on
Commit
a0e8d16
Β·
verified Β·
1 Parent(s): c861c77

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +188 -0
README.md ADDED
@@ -0,0 +1,188 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ license: apache-2.0
4
+ tags:
5
+ - text2text-generation
6
+ - ocr
7
+ - error-correction
8
+ - bart
9
+ - historical-text
10
+ datasets:
11
+ - custom
12
+ metrics:
13
+ - cer
14
+ - wer
15
+ model-index:
16
+ - name: bart-synthetic-data-vampyre-ocr-correction
17
+ results:
18
+ - task:
19
+ type: text2text-generation
20
+ name: OCR Error Correction
21
+ dataset:
22
+ type: custom
23
+ name: The Vampyre (Synthetic + Real)
24
+ metrics:
25
+ - type: cer
26
+ value: 14.49
27
+ name: Character Error Rate
28
+ - type: wer
29
+ value: 37.99
30
+ name: Word Error Rate
31
+ ---
32
+
33
+ # BART-Base OCR Error Correction (Synthetic Data + Real Vampyre Text)
34
+
35
+ This model is a fine-tuned version of [facebook/bart-base](https://huggingface.co/facebook/bart-base) for correcting OCR errors in historical texts, specifically trained on "The Vampyre" dataset.
36
+
37
+ ## 🎯 Model Description
38
+
39
+ - **Base Model:** facebook/bart-base
40
+ - **Task:** OCR error correction
41
+ - **Training Strategy:**
42
+ - Train/Val: Synthetic OCR data (1020 samples with GPT-4 generated errors)
43
+ - Test: Real OCR data from "The Vampyre" (300 samples)
44
+ - **Best Checkpoint:** Epoch 2
45
+ - **Validation CER:** 14.49%
46
+ - **Validation WER:** 37.99%
47
+
48
+ ## πŸ“Š Performance
49
+
50
+ Evaluated on real historical OCR text from "The Vampyre":
51
+
52
+ | Metric | Score |
53
+ |--------|-------|
54
+ | **Character Error Rate (CER)** | **14.49%** |
55
+ | **Word Error Rate (WER)** | **37.99%** |
56
+ | **Exact Match** | 0.0% |
57
+
58
+ ## πŸš€ Usage
59
+
60
+ ### Quick Start
61
+ ```python
62
+ from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
63
+
64
+ # Load model and tokenizer
65
+ tokenizer = AutoTokenizer.from_pretrained("ejaz111/bart-synthetic-data-vampyre-ocr-correction")
66
+ model = AutoModelForSeq2SeqLM.from_pretrained("ejaz111/bart-synthetic-data-vampyre-ocr-correction")
67
+
68
+ # Correct OCR errors
69
+ ocr_text = "Th1s 1s an 0CR err0r w1th m1stakes in the anc1ent text."
70
+ input_ids = tokenizer(ocr_text, return_tensors="pt", max_length=512, truncation=True).input_ids
71
+ outputs = model.generate(input_ids, max_length=512)
72
+ corrected_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
73
+
74
+ print(f"Original: {ocr_text}")
75
+ print(f"Corrected: {corrected_text}")
76
+ ```
77
+
78
+ ### Using Pipeline
79
+ ```python
80
+ from transformers import pipeline
81
+
82
+ corrector = pipeline("text2text-generation", model="ejaz111/bart-synthetic-data-vampyre-ocr-correction")
83
+ result = corrector("The breeze wh15pered so7tly through the anci3nt tre55")[0]['generated_text']
84
+ print(result)
85
+ # Output: "The breeze whispered softly through the ancient trees"
86
+ ```
87
+
88
+ ## πŸŽ“ Training Details
89
+
90
+ ### Training Data
91
+ - **Synthetic Data (Train/Val):** 1020 samples
92
+ - 85% training (~867 samples)
93
+ - 15% validation (~153 samples)
94
+ - Generated using GPT-4 with 20 corruption strategies
95
+ - **Real Data (Test):** 300 samples from "The Vampyre" OCR text
96
+ - **No data leakage:** Test set contains only real OCR data, never seen during training
97
+
98
+ ### Training Configuration
99
+ - **Epochs:** 20 (best model at epoch 2)
100
+ - **Batch Size:** 16
101
+ - **Learning Rate:** 1e-4
102
+ - **Optimizer:** AdamW with weight decay 0.01
103
+ - **Scheduler:** Linear with warmup (10% warmup steps)
104
+ - **Max Sequence Length:** 512 tokens
105
+ - **Architecture:** BART encoder-decoder with 139M parameters
106
+ - **Training Time:** ~30 minutes on GPU
107
+
108
+ ### Corruption Strategies (Training Data)
109
+ The synthetic training data included these OCR error types:
110
+ - Character substitutions (visual similarity)
111
+ - Missing/extra characters
112
+ - Word boundary errors
113
+ - Case errors
114
+ - Punctuation errors
115
+ - Long s (ΕΏ) substitutions
116
+ - Historical typography errors
117
+
118
+ ## πŸ“ˆ Training Progress
119
+
120
+ The model showed rapid improvement in early epochs:
121
+ - Epoch 1: CER 16.62%
122
+ - **Epoch 2: CER 14.49%** ⭐ (Best)
123
+ - Epoch 3: CER 15.86%
124
+ - Later epochs showed overfitting with CER rising to ~20%
125
+
126
+ The best checkpoint from epoch 2 was saved and is the one available in this repository.
127
+
128
+ ## πŸ’‘ Use Cases
129
+
130
+ This model is particularly effective for:
131
+ - Correcting OCR errors in historical documents
132
+ - Post-processing digitized manuscripts
133
+ - Cleaning text from scanned historical books
134
+ - Literary text restoration
135
+ - Academic research on historical texts
136
+
137
+ ## ⚠️ Limitations
138
+
139
+ - Optimized for English historical texts
140
+ - Best performance on texts similar to 19th-century literature
141
+ - May struggle with extremely degraded or non-standard OCR
142
+ - Maximum input length: 512 tokens
143
+ - Higher WER compared to T5 baseline (37.99% vs 22.52%)
144
+
145
+ ## πŸ”¬ Model Comparison
146
+
147
+ | Model | CER | WER | Parameters |
148
+ |-------|-----|-----|------------|
149
+ | **BART-base** | **14.49%** | 37.99% | 139M |
150
+ | T5-base | 13.93% | **22.52%** | 220M |
151
+
152
+ BART achieves slightly better character-level accuracy but struggles more with word-level corrections.
153
+
154
+ ## πŸ”¬ Evaluation Examples
155
+
156
+ | Original OCR | Corrected Output |
157
+ |-------------|------------------|
158
+ | "Th1s 1s an 0CR err0r" | "This is an OCR error" |
159
+ | "The anci3nt tre55" | "The ancient trees" |
160
+ | "bl0omiNg floweRs" | "blooming flowers" |
161
+
162
+ ## πŸ“š Citation
163
+
164
+ If you use this model in your research, please cite:
165
+ ```bibtex
166
+ @misc{bart-vampyre-ocr,
167
+ author = {Ejaz},
168
+ title = {BART Base OCR Error Correction for Historical Texts},
169
+ year = {2025},
170
+ publisher = {Hugging Face},
171
+ howpublished = {\url{https://huggingface.co/ejaz111/bart-synthetic-data-vampyre-ocr-correction}}
172
+ }
173
+ ```
174
+
175
+ ## πŸ‘€ Author
176
+
177
+ **Ejaz** - Master's Student in AI and Robotics
178
+
179
+ ## πŸ“„ License
180
+
181
+ Apache 2.0
182
+
183
+ ## πŸ™ Acknowledgments
184
+
185
+ - Base model: [facebook/bart-base](https://huggingface.co/facebook/bart-base)
186
+ - Training data: "The Vampyre" by John William Polidori
187
+ - Synthetic data generation: GPT-4
188
+ - Companion model: [ejaz111/t5-synthetic-data-vampyre-ocr-correction](https://huggingface.co/ejaz111/t5-synthetic-data-vampyre-ocr-correction)