English ↔ Darija Translator 🇲🇦

A bidirectional neural machine translation model for translating between English and Moroccan Darija (Moroccan Arabic dialect), built using state-of-the-art transformer architecture.

🎯 Project Overview

This project aims to bridge the communication gap between English and Moroccan Darija by providing an accurate, bidirectional translation system. The model is fine-tuned on the facebook/nllb-200-distilled-600M multilingual transformer, specifically adapted for the Darija dialect.

Use Cases

💬 Chatbots - Enable multilingual customer support
📚 Educational Applications - Language learning tools
🌍 Cross-cultural Communication - Breaking language barriers
📱 Mobile Applications - Real-time translation services

✨ Features

Bidirectional Translation: English → Darija and Darija → English
State-of-the-art Model: Fine-tuned NLLB-200 (600M parameters)
High Performance: BLEU score ≥ 25-30 on test corpus
Robust Preprocessing: Advanced text normalization and tokenization
Easy to Use: Simple API for integration

🏗️ Architecture

Corpus (English ↔ Darija)
           ↓
Preprocessing (Normalization, Tokenization)
           ↓
   ┌──────────────────┐
   │  NLLB Transformer │
   │  Encoder-Decoder  │
   └──────────────────┘
           ↓
    Translation Output

📊 Model Performance

Metric	Score
BLEU	≥ 25-30
METEOR	High semantic alignment
chrF	Character-level accuracy
Evaluation Loss	1.54

🚀 Quick Start

Prerequisites

Python 3.8+
PyTorch 2.0+
CUDA-compatible GPU (recommended)

Usage

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Load the model
model_name = "NeoAivara/English_to_Darija_translator"

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# Define language codes (source and target)
src_lang = "eng_Latn"  # English
tgt_lang = "ary_Arab"  # Moroccan Arabic (Darija)

# Example text
text = "yesterday I went to the market with my sister to buy some vegetables and fruits. The prices were a bit high, but we found some good deals. After that, we stopped at a small café near the station to drink mint tea. It was sunny, and people were sitting outside, talking and laughing. I love those calm moments when everything feels simple and peaceful."

# Tokenize input
tokenizer.src_lang = src_lang # Set source language for tokenizer
inputs = tokenizer(text, return_tensors="pt")

# Set the language tokens for source and target
inputs["forced_bos_token_id"] = tokenizer.convert_tokens_to_ids(tgt_lang) # Corrected line

# Generate translation
translated_tokens = model.generate(**inputs, max_length=200)
translated_text = tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]

print(f"English: {text}")
print(f"Darija: {translated_text}")

🔧 Training

Training Configuration

training_args = {
    "num_train_epochs": 5,
    "per_device_train_batch_size": 8,
    "per_device_eval_batch_size": 8,
    "learning_rate": 2e-5,
    "weight_decay": 0.01,
    "warmup_steps": 500,
}

📈 Evaluation

The model is evaluated using multiple metrics:

BLEU: Measures translation quality
METEOR: Semantic similarity
chrF: Character n-gram F-score
Qualitative Analysis: Native speaker review

🛠️ Technologies Used

Python - Core programming language
PyTorch - Deep learning framework
Hugging Face Transformers - Pre-trained models and tokenizers
FastAPI - API deployment
SentencePiece - Subword tokenization
scikit-learn - Data splitting and preprocessing

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

👥 Authors

MOHAMMED EL KASSOIRI - GitHub Profile

🙏 Acknowledgments

Meta AI for the NLLB-200 model
Hugging Face for the Transformers library
The Moroccan NLP community for dataset contributions

📊 Citation

If you use this project in your research, please cite:

@misc{english-darija-translator,
  author = {MOHAMMED EL KASSOIRI},
  title = {English to Darija Translator},
  year = {2024},
  publisher = {GitHub},
  url = {https://github.com/Mohammed-El-Kassoiri/English-to-Darija-Translator}
}

⭐ If you find this project useful, please consider giving it a star on GitHub!

Downloads last month: 32

Safetensors

Model size

0.6B params

Tensor type

F32

Model tree for NeoAivara/English_to_Darija_translator

Base model

facebook/nllb-200-distilled-600M

Finetuned

(208)

this model

NeoAivara
/

English_to_Darija_translator