---
license: apache-2.0
language:
- en
- zh
- ru
- uk
- cs
- ja
- ko
pipeline_tag: translation
---

# Marco-MT-Algharb

This repository contains the system for Algharb, the submission from the Marco Translation Team of Alibaba International Digital Commerce (AIDC) to the WMT 2025 General Machine Translation Shared Task.

## Introduction

The Algharb system is a large translation model built based on the Qwen3-14B foundation. It is designed for high-quality translation across 13 diverse language directions and demonstrates state-of-the-art performance. Our approach is centered on a multi-stage refinement pipeline that systematically enhances translation fluency and faithfulness.

## Usage

The model expects a specific instruction format for translation. The following example demonstrates how to construct the prompt and perform generation using the vllm library for efficient inference.

### 1. Dependencies

First, ensure you have the necessary libraries installed:

```bash
pip install torch transformers==4.55.0 vllm==0.10.0
```

### 2. Prompt Format and Decoding

The core of the process involves formatting the input text into a specific prompt template and then using the vllm engine to generate translations. For our hybrid decoding strategy, we generate multiple candidates (n > 1) for later re-ranking.
The prompt template is:

```python
"Human: Please translate the following text into {target_language}: \n{source_text}&lt;|im_end|&gt;\nAssistant:"
```

Here is a complete Python example:
```python
from vllm import LLM, SamplingParams

# --- 1. Load Model and Tokenizer ---
model_path = "path/to/your/algharb_model"
llm = LLM(model=model_path)

# --- 2. Define Source Text and Target Language ---
source_text = "This paper presents the Algharb system, our submission to the WMT 2025."
source_lang_code = "en_XX" # Not used in prompt, for tracking
target_lang_code = "zh_CN"

# Helper dictionary to map language codes to full names for the prompt
lang_name_map = {
    "zh_CN": "chinese",
    "ko_KR": "korean",
    "ja_JP": "japanese",
    "ar_EG": "arabic",
    "cs_CZ": "czech",
    "ru_RU": "russian",
    "uk_UA": "ukraine",
    "et_EE": "estonian",
    "bho_IN": "bhojpuri",
    "sr_Latn_RS": "serbian",
    "de_DE": "german"
}

target_language_name = lang_name_map.get(target_lang_code, "the target language")

# --- 3. Construct the Prompt ---
prompt = (
    f"Human: Please translate the following text into {target_language_name}: \n"
    f"{source_text}<|im_end|>\n"
    f"Assistant:"
)

prompts_to_generate = [prompt]
print("Formatted Prompt:\n", prompt)

sampling_params = SamplingParams(
    n=100,
    temperature=1.0,
    top_p=1.0,
    max_tokens=512
)

# --- 5. Generate Translations ---
outputs = llm.generate(prompts_to_generate, sampling_params)

# --- 6. Process and Print Results ---
# The 'outputs' list contains one item for each prompt.
for output in outputs:
    prompt_used = output.prompt
    print(f"\n--- Candidates for source: '{source_text}' ---")
    
    # Each output object contains 'n' generated sequences.
    for i, candidate in enumerate(output.outputs):
        generated_text = candidate.text.strip()
        print(f"Candidate {i+1}: {generated_text}")
```

### 3. Apply MBR decoding
```bash
comet-mbr -s src.txt -t mbr_sample_100.txt -o mbr_trans.txt --num_samples 100 --gpus 1 --qe_model Unbabel/wmt22-cometkiwi-da
```
Note: Word alignment for MBR reranking will be available soon.