YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

RzenEmbed-v2-7B

RzenEmbed-v2-7B is a multimodal embedding model developed and open-sourced by 360CVGroup. It achieves state-of-the-art (SOTA) results on the MMEB-V2, MMEB-Visdoc, and MMEB-Video benchmarks (as of September 29, 2025).

MMEB-V2

Model	Model Size (B)	Overall	Image-Overall	Video-Overall	Visdoc-Overall
RzenEmbed-v2-7B	8.29	71.61	75.92	55.73	77.06
seed-1.6-embedding	unknown	71.27	77.78	55.34	73.44
Ops-MM-embedding-v1-7B	8.29	67.61	72.72	53.76	70.34
Ops-MM-embedding-v1-2B	2.21	63.44	69.03	47.56	66.96
interestFM-UIR-CAFe-7B	8.03	60.63	67.56	42.4	63.92
VLM2Vec-V2.0-Qwen2VL-2B	2.21	58.02	64.85	34.85	65.36
gme-Qwen2-VL-7B-Instruct	8.29	57.83	55.95	38.43	75.18
gme-Qwen2-VL-2B-Instruct	2.21	54.08	51.89	33.64	72.71

MMEB-Image

Models	Model Size(B)	Image-Overall	I-CLS	I-QA	I-RET	I-VG
seed-1.6-embedding	unknown	77.78	76.06	73.97	77.9	91.25
RzenEmbed-v2-7B	8.29	75.92	70.61	71.67	78.5	92.1
QQMM-embed-v2	8.29	75.28	72.97	71.85	76.01	87.42
ReCo-7B	8.29	73.87	70.95	71.52	73.66	87.70
OEmbedding-v1-7B	8.29	72.79	70.05	68.1	73.84	88.25
Ops-MM-embedding-v1-7B	8.29	72.72	69.65	69.58	73.09	87.15
QQMM-embed	8.29	72.18	70.07	69.52	71.18	87.08
B3_Qwen2_7B	8.29	72.00	70.00	66.50	74.10	84.60

MMEB-Video

Models	Model Size(B)	Video-Overall	V-CLS	V-QA	V-RET	V-MRET
RzenEmbed-v2-7B	8.29	55.73	58.82	63.5	50.97	45.54
seed-1.6-embedding	unknown	55.34	54.99	60.85	51.33	53.45
Ops-MM-embedding-v1-7B	8.29	53.76	59.68	62.22	45.72	43.21
interestFM-UIR-CAFe-7B	8.03	42.40	35.81	58.66	34.44	39.53
gme-Qwen2-VL-7B-Instruct	8.29	38.43	37.44	50.35	28.37	36.96
interestFM-UIR-CAFe-0.5B	0.89	35.87	33.90	41.72	29.69	39.69
LamRA-Ret	8.29	34.96	39.27	42.6	24.26	32.84
VLM2Vec-V2.0-Qwen2VL-2B	2.21	34.58	39.30	34.32	28.77	36.82

MMEB-Visdoc

Models	Model Size(B)	Visdoc-Overall	ViDoRe-V1	ViDoRe-V2	VisRAG	VisDoc-OOD
RzenEmbed-v2-7B	8.29	77.06	89.7	60.7	88.7	44.38
gme-Qwen2-VL-7B-Instruct	8.29	75.18	89.44	55.61	84.99	44.4
seed-1.6-embedding	unknown	73.44	85.53	56.57	84.74	43.14
gme-Qwen2-VL-2B-Instruct	2.21	72.71	86.15	53.96	82.52	43.12
colpali-v1.3	2.92	70.97	83.60	51.98	81.13	43.12
Ops-MM-embedding-v1-7B	8.29	70.34	80.05	59.59	79.32	43.34
Ops-MM-embedding-v1-2B	2.21	66.96	76.39	53.18	77.64	41.17
VLM2Vec-V2.0-Qwen2VL-2B	2.21	65.36	75.52	44.86	79.38	39.43

Usage

Text-to-Image Retrieval

Retrieve images that match text captions.

from rzen_embed_inference import RzenEmbed

rzen = RzenEmbed("qihoo360/RzenEmbed")

queries = [
    "A curious kitten and a gentle puppy share a moment of connection on the grass.",
    "Fresh fridge full of berries yogurt milk and snacks."
]
candidates = [
    "assets/example1.jpg",
    "assets/example2.jpg",
]

query_instruction = "Find me an everyday image that matches the given caption: "
candidate_instruction = "Represent the given image."

# Generate embeddings and compute similarity
query_embeds = rzen.get_fused_embeddings(instruction=query_instruction, texts=queries)
candidate_embeds = rzen.get_fused_embeddings(instruction=candidate_instruction, images=candidates)

# Calculate text-to-image similarity scores
similarity_scores = query_embeds @ candidate_embeds.T
print(similarity_scores)

Image-to-Text Retrieval

Find text captions that best match given images.

from rzen_embed_inference import RzenEmbed

rzen = RzenEmbed("qihoo360/RzenEmbed")

queries = [
    "assets/example1.jpg",
    "assets/example2.jpg",
]
candidates = [
    "A curious kitten and a gentle puppy share a moment of connection on the grass.",
    "Fresh fridge full of berries yogurt milk and snacks."
]

query_instruction = "Find an image caption describing the given everyday image."

query_embeds = rzen.get_fused_embeddings(instruction=query_instruction, images=queries)
candidate_embeds = rzen.get_fused_embeddings(texts=candidates)

# Calculate image-to-text similarity scores
similarity_scores = query_embeds @ candidate_embeds.T
print(similarity_scores)

Document Retrieval

Match text queries with document images for information retrieval.

from rzen_embed_inference import RzenEmbed

rzen = RzenEmbed("qihoo360/RzenEmbed")

queries = [
    "What is the main variable being analyzed on the x-axis of these graphs?",
    "What is the personnel costs in the 4th year?"
]
candidates = [
    "assets/example3.jpg",
    "assets/example4.jpg",
]

query_instruction = "Find a document image that matches the given query: "
candidate_instruction = "Understand the content of the provided document image."

# Generate embeddings for document retrieval
query_embeds = rzen.get_fused_embeddings(instruction=query_instruction, texts=queries)
candidate_embeds = rzen.get_fused_embeddings(instruction=candidate_instruction, images=candidates)

# Calculate text-to-document similarity
similarity_scores = query_embeds @ candidate_embeds.T
print(similarity_scores)

Video Retrieval

Retrieve videos based on text captions.

import cv2
import numpy as np
from rzen_embed_inference import RzenEmbed

def extract_frames(video_path, num_frames):
    cap = cv2.VideoCapture(video_path)
    total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
    frame_indices = np.linspace(0, total_frames - 1, num_frames, dtype=int)
    frames = []
    for idx in frame_indices:
        cap.set(cv2.CAP_PROP_POS_FRAMES, idx)
        ret, frame = cap.read()
        if ret:
            frames.append(Image.fromarray(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)))
        else:
            break
    cap.release()
    return frames

rzen = RzenEmbed("qihoo360/RzenEmbed")

queries = [
    "A traditional boat glides along a river lined with blooming cherry blossoms under an overcast sky in a modern cityscape.",
    "Tiny ginger kitten meows cutely by the water."
]

# Extract frames from videos
video_path_list = [
    "assets/example5.mp4",
    "assets/example6.mp4",
]
candidates = [extract_frames(video_path, num_frames=8) for video_path in video_path_list]

query_instruction = "Find the video snippet that corresponds to the given caption: "
candidate_instruction = "Understand the content of the provided video."

# Generate embeddings for video retrieval
query_embeds = rzen.get_fused_embeddings(instruction=query_instruction, texts=queries)
candidate_embeds = rzen.get_fused_embeddings(instruction=candidate_instruction, images=candidates)

# Calculate text-to-video similarity scores
similarity_scores = query_embeds @ candidate_embeds.T
print(similarity_scores)

Citation

If you find RzenEmbed useful for your research and applications, please cite using this BibTeX:

@article{jian2025rzenembed,
  title={RzenEmbed: Towards Comprehensive Multimodal Retrieval},
  author={Jian, Weijian and Zhang, Yajun and Liang, Dawei and Xie, Chunyu and He, Yixiao and Leng, Dawei and Yin, Yuhui},
  journal={arXiv preprint arXiv:2510.27350},
  year={2025}
}

Downloads last month: 231

Safetensors

Model size

8B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including qihoo360/RzenEmbed

RzenEmbed

Collection

2 items • Updated 5 days ago