English β Darija Translator π²π¦
A bidirectional neural machine translation model for translating between English and Moroccan Darija (Moroccan Arabic dialect), built using state-of-the-art transformer architecture.
π― Project Overview
This project aims to bridge the communication gap between English and Moroccan Darija by providing an accurate, bidirectional translation system. The model is fine-tuned on the facebook/nllb-200-distilled-600M multilingual transformer, specifically adapted for the Darija dialect.
Use Cases
- π¬ Chatbots - Enable multilingual customer support
- π Educational Applications - Language learning tools
- π Cross-cultural Communication - Breaking language barriers
- π± Mobile Applications - Real-time translation services
β¨ Features
- Bidirectional Translation: English β Darija and Darija β English
- State-of-the-art Model: Fine-tuned NLLB-200 (600M parameters)
- High Performance: BLEU score β₯ 25-30 on test corpus
- Robust Preprocessing: Advanced text normalization and tokenization
- Easy to Use: Simple API for integration
ποΈ Architecture
Corpus (English β Darija)
β
Preprocessing (Normalization, Tokenization)
β
ββββββββββββββββββββ
β NLLB Transformer β
β Encoder-Decoder β
ββββββββββββββββββββ
β
Translation Output
π Model Performance
| Metric | Score |
|---|---|
| BLEU | β₯ 25-30 |
| METEOR | High semantic alignment |
| chrF | Character-level accuracy |
| Evaluation Loss | 1.54 |
π Quick Start
Prerequisites
Python 3.8+
PyTorch 2.0+
CUDA-compatible GPU (recommended)
Usage
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
# Load the model
model_name = "NeoAivara/English_to_Darija_translator"
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
# Define language codes (source and target)
src_lang = "eng_Latn" # English
tgt_lang = "ary_Arab" # Moroccan Arabic (Darija)
# Example text
text = "yesterday I went to the market with my sister to buy some vegetables and fruits. The prices were a bit high, but we found some good deals. After that, we stopped at a small cafΓ© near the station to drink mint tea. It was sunny, and people were sitting outside, talking and laughing. I love those calm moments when everything feels simple and peaceful."
# Tokenize input
tokenizer.src_lang = src_lang # Set source language for tokenizer
inputs = tokenizer(text, return_tensors="pt")
# Set the language tokens for source and target
inputs["forced_bos_token_id"] = tokenizer.convert_tokens_to_ids(tgt_lang) # Corrected line
# Generate translation
translated_tokens = model.generate(**inputs, max_length=200)
translated_text = tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]
print(f"English: {text}")
print(f"Darija: {translated_text}")
π§ Training
Training Configuration
training_args = {
"num_train_epochs": 5,
"per_device_train_batch_size": 8,
"per_device_eval_batch_size": 8,
"learning_rate": 2e-5,
"weight_decay": 0.01,
"warmup_steps": 500,
}
π Evaluation
The model is evaluated using multiple metrics:
- BLEU: Measures translation quality
- METEOR: Semantic similarity
- chrF: Character n-gram F-score
- Qualitative Analysis: Native speaker review
π οΈ Technologies Used
- Python - Core programming language
- PyTorch - Deep learning framework
- Hugging Face Transformers - Pre-trained models and tokenizers
- FastAPI - API deployment
- SentencePiece - Subword tokenization
- scikit-learn - Data splitting and preprocessing
π License
This project is licensed under the MIT License - see the LICENSE file for details.
π₯ Authors
- MOHAMMED EL KASSOIRI - GitHub Profile
π Acknowledgments
- Meta AI for the NLLB-200 model
- Hugging Face for the Transformers library
- The Moroccan NLP community for dataset contributions
π Citation
If you use this project in your research, please cite:
@misc{english-darija-translator,
author = {MOHAMMED EL KASSOIRI},
title = {English to Darija Translator},
year = {2024},
publisher = {GitHub},
url = {https://github.com/Mohammed-El-Kassoiri/English-to-Darija-Translator}
}
β If you find this project useful, please consider giving it a star on GitHub!
- Downloads last month
- 32
Model tree for NeoAivara/English_to_Darija_translator
Base model
facebook/nllb-200-distilled-600M