--- license: apache-2.0 language: - en - zh - ru - uk - cs - ja - ko pipeline_tag: translation --- # Marco-MT-Algharb This repository contains the system for Algharb, the submission from the Marco Translation Team of Alibaba International Digital Commerce (AIDC) to the WMT 2025 General Machine Translation Shared Task. ## Introduction The Algharb system is a large translation model built based on the Qwen3-14B foundation. It is designed for high-quality translation across 13 diverse language directions and demonstrates state-of-the-art performance. Our approach is centered on a multi-stage refinement pipeline that systematically enhances translation fluency and faithfulness. ## Usage The model expects a specific instruction format for translation. The following example demonstrates how to construct the prompt and perform generation using the vllm library for efficient inference. ### 1. Dependencies First, ensure you have the necessary libraries installed: ```bash pip install torch transformers==4.55.0 vllm==0.10.0 ``` ### 2. Prompt Format and Decoding The core of the process involves formatting the input text into a specific prompt template and then using the vllm engine to generate translations. For our hybrid decoding strategy, we generate multiple candidates (n > 1) for later re-ranking. The prompt template is: ```python "Human: Please translate the following text into {target_language}: \n{source_text}<|im_end|>\nAssistant:" ``` Here is a complete Python example: ```python from vllm import LLM, SamplingParams # --- 1. Load Model and Tokenizer --- model_path = "path/to/your/algharb_model" llm = LLM(model=model_path) # --- 2. Define Source Text and Target Language --- source_text = "This paper presents the Algharb system, our submission to the WMT 2025." source_lang_code = "en_XX" # Not used in prompt, for tracking target_lang_code = "zh_CN" # Helper dictionary to map language codes to full names for the prompt lang_name_map = { "zh_CN": "chinese", "ko_KR": "korean", "ja_JP": "japanese", "ar_EG": "arabic", "cs_CZ": "czech", "ru_RU": "russian", "uk_UA": "ukraine", "et_EE": "estonian", "bho_IN": "bhojpuri", "sr_Latn_RS": "serbian", "de_DE": "german" } target_language_name = lang_name_map.get(target_lang_code, "the target language") # --- 3. Construct the Prompt --- prompt = ( f"Human: Please translate the following text into {target_language_name}: \n" f"{source_text}<|im_end|>\n" f"Assistant:" ) prompts_to_generate = [prompt] print("Formatted Prompt:\n", prompt) sampling_params = SamplingParams( n=100, temperature=1.0, top_p=1.0, max_tokens=512 ) # --- 5. Generate Translations --- outputs = llm.generate(prompts_to_generate, sampling_params) # --- 6. Process and Print Results --- # The 'outputs' list contains one item for each prompt. for output in outputs: prompt_used = output.prompt print(f"\n--- Candidates for source: '{source_text}' ---") # Each output object contains 'n' generated sequences. for i, candidate in enumerate(output.outputs): generated_text = candidate.text.strip() print(f"Candidate {i+1}: {generated_text}") ``` ### 3. Apply MBR decoding ```bash comet-mbr -s src.txt -t mbr_sample_100.txt -o mbr_trans.txt --num_samples 100 --gpus 1 --qe_model Unbabel/wmt22-cometkiwi-da ``` Note: Word alignment for MBR reranking will be available soon.