English ↔ Darija Translator πŸ‡²πŸ‡¦

A bidirectional neural machine translation model for translating between English and Moroccan Darija (Moroccan Arabic dialect), built using state-of-the-art transformer architecture.

Python PyTorch Transformers License

🎯 Project Overview

This project aims to bridge the communication gap between English and Moroccan Darija by providing an accurate, bidirectional translation system. The model is fine-tuned on the facebook/nllb-200-distilled-600M multilingual transformer, specifically adapted for the Darija dialect.

Use Cases

  • πŸ’¬ Chatbots - Enable multilingual customer support
  • πŸ“š Educational Applications - Language learning tools
  • 🌍 Cross-cultural Communication - Breaking language barriers
  • πŸ“± Mobile Applications - Real-time translation services

✨ Features

  • Bidirectional Translation: English β†’ Darija and Darija β†’ English
  • State-of-the-art Model: Fine-tuned NLLB-200 (600M parameters)
  • High Performance: BLEU score β‰₯ 25-30 on test corpus
  • Robust Preprocessing: Advanced text normalization and tokenization
  • Easy to Use: Simple API for integration

πŸ—οΈ Architecture

Corpus (English ↔ Darija)
           ↓
Preprocessing (Normalization, Tokenization)
           ↓
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚  NLLB Transformer β”‚
   β”‚  Encoder-Decoder  β”‚
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           ↓
    Translation Output

πŸ“Š Model Performance

Metric Score
BLEU β‰₯ 25-30
METEOR High semantic alignment
chrF Character-level accuracy
Evaluation Loss 1.54

πŸš€ Quick Start

Prerequisites

Python 3.8+
PyTorch 2.0+
CUDA-compatible GPU (recommended)

Usage

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Load the model
model_name = "NeoAivara/English_to_Darija_translator"

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# Define language codes (source and target)
src_lang = "eng_Latn"  # English
tgt_lang = "ary_Arab"  # Moroccan Arabic (Darija)

# Example text
text = "yesterday I went to the market with my sister to buy some vegetables and fruits. The prices were a bit high, but we found some good deals. After that, we stopped at a small cafΓ© near the station to drink mint tea. It was sunny, and people were sitting outside, talking and laughing. I love those calm moments when everything feels simple and peaceful."

# Tokenize input
tokenizer.src_lang = src_lang # Set source language for tokenizer
inputs = tokenizer(text, return_tensors="pt")

# Set the language tokens for source and target
inputs["forced_bos_token_id"] = tokenizer.convert_tokens_to_ids(tgt_lang) # Corrected line

# Generate translation
translated_tokens = model.generate(**inputs, max_length=200)
translated_text = tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]

print(f"English: {text}")
print(f"Darija: {translated_text}")

πŸ”§ Training

Training Configuration

training_args = {
    "num_train_epochs": 5,
    "per_device_train_batch_size": 8,
    "per_device_eval_batch_size": 8,
    "learning_rate": 2e-5,
    "weight_decay": 0.01,
    "warmup_steps": 500,
}

πŸ“ˆ Evaluation

The model is evaluated using multiple metrics:

  • BLEU: Measures translation quality
  • METEOR: Semantic similarity
  • chrF: Character n-gram F-score
  • Qualitative Analysis: Native speaker review

πŸ› οΈ Technologies Used

  • Python - Core programming language
  • PyTorch - Deep learning framework
  • Hugging Face Transformers - Pre-trained models and tokenizers
  • FastAPI - API deployment
  • SentencePiece - Subword tokenization
  • scikit-learn - Data splitting and preprocessing

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ‘₯ Authors

πŸ™ Acknowledgments

  • Meta AI for the NLLB-200 model
  • Hugging Face for the Transformers library
  • The Moroccan NLP community for dataset contributions

πŸ“Š Citation

If you use this project in your research, please cite:

@misc{english-darija-translator,
  author = {MOHAMMED EL KASSOIRI},
  title = {English to Darija Translator},
  year = {2024},
  publisher = {GitHub},
  url = {https://github.com/Mohammed-El-Kassoiri/English-to-Darija-Translator}
}

⭐ If you find this project useful, please consider giving it a star on GitHub!

Downloads last month
32
Safetensors
Model size
0.6B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for NeoAivara/English_to_Darija_translator

Finetuned
(208)
this model

Datasets used to train NeoAivara/English_to_Darija_translator