odegiber commited on
Commit
90976b8
·
verified ·
1 Parent(s): bdfbc57

Created model card

Browse files
Files changed (1) hide show
  1. README.md +100 -0
README.md ADDED
@@ -0,0 +1,100 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ - gd
5
+ metrics:
6
+ - chrf
7
+ - comet
8
+ tags:
9
+ - synthetic
10
+ - machine-translation
11
+ - low-resource
12
+ - data-augmentation
13
+ - nmt
14
+ - multilingual
15
+ - dataset
16
+ - transformer-base
17
+ license: cc-by-4.0
18
+ datasets:
19
+ - openlanguagedata/flores_plus
20
+ ---
21
+
22
+ # Model Name: `Helsinki-NLP/opus-mt-synthetic-en-gd`
23
+
24
+ ## Model Overview
25
+
26
+ This model is the synthetic baseline (transformer-base) for the English-Scottish Gaelic language pair of our paper ["Scaling Low-Resource MT via Synthetic Data Generation with LLMs"](https://arxiv.org/abs/2505.14423).
27
+ The training data is generated by forward translating English Europarl with GPT-4o and is specifically aimed at improving MT performance for underrepresented languages by supplementing traditional datasets with high-quality, LLM-generated translations.
28
+
29
+ The goal of this model is to provide a baseline for MT tasks, demonstrating the potential of synthetic data to enhance translation capabilities for languages with limited existing resources.
30
+
31
+ For more detailed methodology, see the full paper [here](https://arxiv.org/abs/2505.14423).
32
+
33
+ ## Supported Language Pair:
34
+
35
+ * **English ↔ Scottish Gaelic**
36
+
37
+ ## Evaluation
38
+
39
+ The quality of the generated synthetic data was evaluated using both automatic metrics (such as COMET and ChrF) and human evaluations. The evaluation shows that the synthetic data generally performs well for low-resource languages, with significant gains observed when using the data in downstream MT training.
40
+ Below are the evaluation results on FLORES+:
41
+
42
+ | Language Pair | ChrF Score | COMET Score |
43
+ | ------------------------- | ---------- | ----------- |
44
+ | English ↔ Basque | 53.00 | 81.51 |
45
+ | English ↔ Scottish Gaelic | 51.10 | 78.04 |
46
+ | English ↔ Icelandic | 49.91 | 80.16 |
47
+ | English ↔ Georgian | 49.49 | 80.72 |
48
+ | English ↔ Macedonian | 57.72 | 82.24 |
49
+ | English ↔ Somali | 45.10 | 78.15 |
50
+ | English ↔ Ukrainian | 51.71 | 78.89 |
51
+
52
+ The results demonstrate that synthetic data provides strong baseline performance across all language pairs, with the best performance for Macedonian and Ukrainian, which are relatively less low-resource compared to others.
53
+
54
+ ## Usage
55
+
56
+ You can use this model to generate translations with the following code:
57
+
58
+ ```python
59
+ from transformers import MarianMTModel, MarianTokenizer
60
+
61
+ # Load the pre-trained model and tokenizer
62
+ model_name = "Helsinki-NLP/opus-mt-synthetic-en-gd"
63
+ model = MarianMTModel.from_pretrained(model_name)
64
+ tokenizer = MarianTokenizer.from_pretrained(model_name)
65
+
66
+ # Example source text (English)
67
+ source_texts = ["Hello, how are you?", "Good morning!", "What is your name?"]
68
+
69
+ # Tokenize the input texts
70
+ inputs = tokenizer(source_texts, return_tensors="pt", padding=True, truncation=True)
71
+
72
+ # Generate translations
73
+ translated_ids = model.generate(inputs["input_ids"])
74
+
75
+ # Decode the generated tokens to get the translated text
76
+ translated_texts = tokenizer.batch_decode(translated_ids, skip_special_tokens=True)
77
+
78
+ # Print the translations
79
+ for src, tgt in zip(source_texts, translated_texts):
80
+ print(f"Source: {src} => Translated: {tgt}")
81
+ ```
82
+
83
+ For the given English sentences, the output might look something like this:
84
+
85
+ ```
86
+ Source: How are you? => Translated: Ciamar a tha thu?
87
+ Source: Good morning! => Translated: Madainn mhath
88
+ Source: What is your name? => Translated: Dè an t-ainm a th’ agad?
89
+ ```
90
+
91
+ ## Citation Information
92
+
93
+ ```bibtex
94
+ @article{degibert2025scaling,
95
+ title={Scaling Low-Resource MT via Synthetic Data Generation with LLMs},
96
+ author={de Gibert, Ona and Attieh, Joseph and Vahtola, Teemu and Aulamo, Mikko and Li, Zihao and V{\'a}zquez, Ra{\'u}l and Hu, Tiancheng and Tiedemann, J{\"o}rg},
97
+ journal={arXiv preprint arXiv:2505.14423},
98
+ year={2025}
99
+ }
100
+ ```