Created model card
Browse files
README.md
ADDED
|
@@ -0,0 +1,100 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language:
|
| 3 |
+
- en
|
| 4 |
+
- gd
|
| 5 |
+
metrics:
|
| 6 |
+
- chrf
|
| 7 |
+
- comet
|
| 8 |
+
tags:
|
| 9 |
+
- synthetic
|
| 10 |
+
- machine-translation
|
| 11 |
+
- low-resource
|
| 12 |
+
- data-augmentation
|
| 13 |
+
- nmt
|
| 14 |
+
- multilingual
|
| 15 |
+
- dataset
|
| 16 |
+
- transformer-base
|
| 17 |
+
license: cc-by-4.0
|
| 18 |
+
datasets:
|
| 19 |
+
- openlanguagedata/flores_plus
|
| 20 |
+
---
|
| 21 |
+
|
| 22 |
+
# Model Name: `Helsinki-NLP/opus-mt-synthetic-en-gd`
|
| 23 |
+
|
| 24 |
+
## Model Overview
|
| 25 |
+
|
| 26 |
+
This model is the synthetic baseline (transformer-base) for the English-Scottish Gaelic language pair of our paper ["Scaling Low-Resource MT via Synthetic Data Generation with LLMs"](https://arxiv.org/abs/2505.14423).
|
| 27 |
+
The training data is generated by forward translating English Europarl with GPT-4o and is specifically aimed at improving MT performance for underrepresented languages by supplementing traditional datasets with high-quality, LLM-generated translations.
|
| 28 |
+
|
| 29 |
+
The goal of this model is to provide a baseline for MT tasks, demonstrating the potential of synthetic data to enhance translation capabilities for languages with limited existing resources.
|
| 30 |
+
|
| 31 |
+
For more detailed methodology, see the full paper [here](https://arxiv.org/abs/2505.14423).
|
| 32 |
+
|
| 33 |
+
## Supported Language Pair:
|
| 34 |
+
|
| 35 |
+
* **English ↔ Scottish Gaelic**
|
| 36 |
+
|
| 37 |
+
## Evaluation
|
| 38 |
+
|
| 39 |
+
The quality of the generated synthetic data was evaluated using both automatic metrics (such as COMET and ChrF) and human evaluations. The evaluation shows that the synthetic data generally performs well for low-resource languages, with significant gains observed when using the data in downstream MT training.
|
| 40 |
+
Below are the evaluation results on FLORES+:
|
| 41 |
+
|
| 42 |
+
| Language Pair | ChrF Score | COMET Score |
|
| 43 |
+
| ------------------------- | ---------- | ----------- |
|
| 44 |
+
| English ↔ Basque | 53.00 | 81.51 |
|
| 45 |
+
| English ↔ Scottish Gaelic | 51.10 | 78.04 |
|
| 46 |
+
| English ↔ Icelandic | 49.91 | 80.16 |
|
| 47 |
+
| English ↔ Georgian | 49.49 | 80.72 |
|
| 48 |
+
| English ↔ Macedonian | 57.72 | 82.24 |
|
| 49 |
+
| English ↔ Somali | 45.10 | 78.15 |
|
| 50 |
+
| English ↔ Ukrainian | 51.71 | 78.89 |
|
| 51 |
+
|
| 52 |
+
The results demonstrate that synthetic data provides strong baseline performance across all language pairs, with the best performance for Macedonian and Ukrainian, which are relatively less low-resource compared to others.
|
| 53 |
+
|
| 54 |
+
## Usage
|
| 55 |
+
|
| 56 |
+
You can use this model to generate translations with the following code:
|
| 57 |
+
|
| 58 |
+
```python
|
| 59 |
+
from transformers import MarianMTModel, MarianTokenizer
|
| 60 |
+
|
| 61 |
+
# Load the pre-trained model and tokenizer
|
| 62 |
+
model_name = "Helsinki-NLP/opus-mt-synthetic-en-gd"
|
| 63 |
+
model = MarianMTModel.from_pretrained(model_name)
|
| 64 |
+
tokenizer = MarianTokenizer.from_pretrained(model_name)
|
| 65 |
+
|
| 66 |
+
# Example source text (English)
|
| 67 |
+
source_texts = ["Hello, how are you?", "Good morning!", "What is your name?"]
|
| 68 |
+
|
| 69 |
+
# Tokenize the input texts
|
| 70 |
+
inputs = tokenizer(source_texts, return_tensors="pt", padding=True, truncation=True)
|
| 71 |
+
|
| 72 |
+
# Generate translations
|
| 73 |
+
translated_ids = model.generate(inputs["input_ids"])
|
| 74 |
+
|
| 75 |
+
# Decode the generated tokens to get the translated text
|
| 76 |
+
translated_texts = tokenizer.batch_decode(translated_ids, skip_special_tokens=True)
|
| 77 |
+
|
| 78 |
+
# Print the translations
|
| 79 |
+
for src, tgt in zip(source_texts, translated_texts):
|
| 80 |
+
print(f"Source: {src} => Translated: {tgt}")
|
| 81 |
+
```
|
| 82 |
+
|
| 83 |
+
For the given English sentences, the output might look something like this:
|
| 84 |
+
|
| 85 |
+
```
|
| 86 |
+
Source: How are you? => Translated: Ciamar a tha thu?
|
| 87 |
+
Source: Good morning! => Translated: Madainn mhath
|
| 88 |
+
Source: What is your name? => Translated: Dè an t-ainm a th’ agad?
|
| 89 |
+
```
|
| 90 |
+
|
| 91 |
+
## Citation Information
|
| 92 |
+
|
| 93 |
+
```bibtex
|
| 94 |
+
@article{degibert2025scaling,
|
| 95 |
+
title={Scaling Low-Resource MT via Synthetic Data Generation with LLMs},
|
| 96 |
+
author={de Gibert, Ona and Attieh, Joseph and Vahtola, Teemu and Aulamo, Mikko and Li, Zihao and V{\'a}zquez, Ra{\'u}l and Hu, Tiancheng and Tiedemann, J{\"o}rg},
|
| 97 |
+
journal={arXiv preprint arXiv:2505.14423},
|
| 98 |
+
year={2025}
|
| 99 |
+
}
|
| 100 |
+
```
|