Safetensors
qwen2_vl
YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

RzenEmbed-v2-7B

RzenEmbed-v2-7B is a multimodal embedding model developed and open-sourced by 360CVGroup. It achieves state-of-the-art (SOTA) results on the MMEB-V2, MMEB-Visdoc, and MMEB-Video benchmarks (as of September 29, 2025).

arXiv GitHub Benchmark

MMEB-V2

Model Model Size (B) Overall Image-Overall Video-Overall Visdoc-Overall
RzenEmbed-v2-7B 8.29 71.61 75.92 55.73 77.06
seed-1.6-embedding unknown 71.27 77.78 55.34 73.44
Ops-MM-embedding-v1-7B 8.29 67.61 72.72 53.76 70.34
Ops-MM-embedding-v1-2B 2.21 63.44 69.03 47.56 66.96
interestFM-UIR-CAFe-7B 8.03 60.63 67.56 42.4 63.92
VLM2Vec-V2.0-Qwen2VL-2B 2.21 58.02 64.85 34.85 65.36
gme-Qwen2-VL-7B-Instruct 8.29 57.83 55.95 38.43 75.18
gme-Qwen2-VL-2B-Instruct 2.21 54.08 51.89 33.64 72.71

MMEB-Image

Models Model Size(B) Image-Overall I-CLS I-QA I-RET I-VG
seed-1.6-embedding unknown 77.78 76.06 73.97 77.9 91.25
RzenEmbed-v2-7B 8.29 75.92 70.61 71.67 78.5 92.1
QQMM-embed-v2 8.29 75.28 72.97 71.85 76.01 87.42
ReCo-7B 8.29 73.87 70.95 71.52 73.66 87.70
OEmbedding-v1-7B 8.29 72.79 70.05 68.1 73.84 88.25
Ops-MM-embedding-v1-7B 8.29 72.72 69.65 69.58 73.09 87.15
QQMM-embed 8.29 72.18 70.07 69.52 71.18 87.08
B3_Qwen2_7B 8.29 72.00 70.00 66.50 74.10 84.60

MMEB-Video

Models Model Size(B) Video-Overall V-CLS V-QA V-RET V-MRET
RzenEmbed-v2-7B 8.29 55.73 58.82 63.5 50.97 45.54
seed-1.6-embedding unknown 55.34 54.99 60.85 51.33 53.45
Ops-MM-embedding-v1-7B 8.29 53.76 59.68 62.22 45.72 43.21
interestFM-UIR-CAFe-7B 8.03 42.40 35.81 58.66 34.44 39.53
gme-Qwen2-VL-7B-Instruct 8.29 38.43 37.44 50.35 28.37 36.96
interestFM-UIR-CAFe-0.5B 0.89 35.87 33.90 41.72 29.69 39.69
LamRA-Ret 8.29 34.96 39.27 42.6 24.26 32.84
VLM2Vec-V2.0-Qwen2VL-2B 2.21 34.58 39.30 34.32 28.77 36.82

MMEB-Visdoc

Models Model Size(B) Visdoc-Overall ViDoRe-V1 ViDoRe-V2 VisRAG VisDoc-OOD
RzenEmbed-v2-7B 8.29 77.06 89.7 60.7 88.7 44.38
gme-Qwen2-VL-7B-Instruct 8.29 75.18 89.44 55.61 84.99 44.4
seed-1.6-embedding unknown 73.44 85.53 56.57 84.74 43.14
gme-Qwen2-VL-2B-Instruct 2.21 72.71 86.15 53.96 82.52 43.12
colpali-v1.3 2.92 70.97 83.60 51.98 81.13 43.12
Ops-MM-embedding-v1-7B 8.29 70.34 80.05 59.59 79.32 43.34
Ops-MM-embedding-v1-2B 2.21 66.96 76.39 53.18 77.64 41.17
VLM2Vec-V2.0-Qwen2VL-2B 2.21 65.36 75.52 44.86 79.38 39.43

Usage

Text-to-Image Retrieval

Retrieve images that match text captions.

from rzen_embed_inference import RzenEmbed

rzen = RzenEmbed("qihoo360/RzenEmbed")

queries = [
    "A curious kitten and a gentle puppy share a moment of connection on the grass.",
    "Fresh fridge full of berries yogurt milk and snacks."
]
candidates = [
    "assets/example1.jpg",
    "assets/example2.jpg",
]

query_instruction = "Find me an everyday image that matches the given caption: "
candidate_instruction = "Represent the given image."

# Generate embeddings and compute similarity
query_embeds = rzen.get_fused_embeddings(instruction=query_instruction, texts=queries)
candidate_embeds = rzen.get_fused_embeddings(instruction=candidate_instruction, images=candidates)

# Calculate text-to-image similarity scores
similarity_scores = query_embeds @ candidate_embeds.T
print(similarity_scores)

Image-to-Text Retrieval

Find text captions that best match given images.

from rzen_embed_inference import RzenEmbed

rzen = RzenEmbed("qihoo360/RzenEmbed")

queries = [
    "assets/example1.jpg",
    "assets/example2.jpg",
]
candidates = [
    "A curious kitten and a gentle puppy share a moment of connection on the grass.",
    "Fresh fridge full of berries yogurt milk and snacks."
]

query_instruction = "Find an image caption describing the given everyday image."

query_embeds = rzen.get_fused_embeddings(instruction=query_instruction, images=queries)
candidate_embeds = rzen.get_fused_embeddings(texts=candidates)

# Calculate image-to-text similarity scores
similarity_scores = query_embeds @ candidate_embeds.T
print(similarity_scores)

Document Retrieval

Match text queries with document images for information retrieval.

from rzen_embed_inference import RzenEmbed

rzen = RzenEmbed("qihoo360/RzenEmbed")

queries = [
    "What is the main variable being analyzed on the x-axis of these graphs?",
    "What is the personnel costs in the 4th year?"
]
candidates = [
    "assets/example3.jpg",
    "assets/example4.jpg",
]

query_instruction = "Find a document image that matches the given query: "
candidate_instruction = "Understand the content of the provided document image."

# Generate embeddings for document retrieval
query_embeds = rzen.get_fused_embeddings(instruction=query_instruction, texts=queries)
candidate_embeds = rzen.get_fused_embeddings(instruction=candidate_instruction, images=candidates)

# Calculate text-to-document similarity
similarity_scores = query_embeds @ candidate_embeds.T
print(similarity_scores)

Video Retrieval

Retrieve videos based on text captions.

import cv2
import numpy as np
from rzen_embed_inference import RzenEmbed

def extract_frames(video_path, num_frames):
    cap = cv2.VideoCapture(video_path)
    total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
    frame_indices = np.linspace(0, total_frames - 1, num_frames, dtype=int)
    frames = []
    for idx in frame_indices:
        cap.set(cv2.CAP_PROP_POS_FRAMES, idx)
        ret, frame = cap.read()
        if ret:
            frames.append(Image.fromarray(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)))
        else:
            break
    cap.release()
    return frames

rzen = RzenEmbed("qihoo360/RzenEmbed")

queries = [
    "A traditional boat glides along a river lined with blooming cherry blossoms under an overcast sky in a modern cityscape.",
    "Tiny ginger kitten meows cutely by the water."
]

# Extract frames from videos
video_path_list = [
    "assets/example5.mp4",
    "assets/example6.mp4",
]
candidates = [extract_frames(video_path, num_frames=8) for video_path in video_path_list]

query_instruction = "Find the video snippet that corresponds to the given caption: "
candidate_instruction = "Understand the content of the provided video."

# Generate embeddings for video retrieval
query_embeds = rzen.get_fused_embeddings(instruction=query_instruction, texts=queries)
candidate_embeds = rzen.get_fused_embeddings(instruction=candidate_instruction, images=candidates)

# Calculate text-to-video similarity scores
similarity_scores = query_embeds @ candidate_embeds.T
print(similarity_scores)

Citation

If you find RzenEmbed useful for your research and applications, please cite using this BibTeX:

@article{jian2025rzenembed,
  title={RzenEmbed: Towards Comprehensive Multimodal Retrieval},
  author={Jian, Weijian and Zhang, Yajun and Liang, Dawei and Xie, Chunyu and He, Yixiao and Leng, Dawei and Yin, Yuhui},
  journal={arXiv preprint arXiv:2510.27350},
  year={2025}
}
Downloads last month
231
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including qihoo360/RzenEmbed