BindWeave: Subject-Consistent Video Generation via Cross-Modal Integration

BindWeave: Subject-Consistent Video Generation via Cross-Modal Integration

Zhaoyang Li^1,2, Dongjun Qian², Kai Su^2*, Qishuai Diao², Xiangyang Xia², Chang Liu², Wenfei Yang¹, Tianzhu Zhang^1*, Zehuan Yuan²

¹University of Science and Technology of China ²ByteDance
^*Corresponding Author

📖 Overview

BindWeave is a unified subject-consistent video generation framework for single- and multi-subject prompts, built on an MLLM-DiT architecture that couples a pretrained multimodal large language model with a diffusion transformer. It achieves cross-modal integration via entity grounding and representation alignment, leveraging the MLLM to parse complex prompts and produce subject-aware hidden states that condition the DiT for high-fidelity generation. For more details or tutorials refer to ByteDance/BindWeave

OpenS2V-Eval Performance 🏆

BindWeave achieves a solid score of 57.61 on the OpenS2V-Eval benchmark, highlighting its robust capabilities across multiple evaluation dimensions and demonstrating competitive performance against several leading open-source and commercial systems.

Model	TotalScore↑	AestheticScore↑	MotionSmoothness↑	MotionAmplitude↑	FaceSim↑	GmeScore↑	NexusScore↑	NaturalScore↑
BindWeave	57.61%	45.55%	95.90%	13.91%	53.71%	67.79%	46.84%	66.85%
VACE-14B	57.55%	47.21%	94.97%	15.02%	55.09%	67.27%	44.08%	67.04%
Phantom-14B	56.77%	46.39%	96.31%	33.42%	51.46%	70.65%	37.43%	69.35%
Kling1.6(20250503)	56.23%	44.59%	86.93%	41.6%	40.1%	66.2%	45.89%	74.59%
Phantom-1.3B	54.89%	46.67%	93.3%	14.29%	48.56%	69.43%	42.48%	62.5%
MAGREF-480P	52.51%	45.02%	93.17%	21.81%	30.83%	70.47%	43.04%	66.9%
SkyReels-A2-P14B	52.25%	39.41%	87.93%	25.6%	45.95%	64.54%	43.75%	60.32%
Vidu2.0(20250503)	51.95%	41.48%	90.45%	13.52%	35.11%	67.57%	43.37%	65.88%
Pika2.1(20250503)	51.88%	46.88%	87.06%	24.71%	30.38%	69.19%	45.4%	63.32%
VACE-1.3B	49.89%	48.24%	97.2%	18.83%	20.57%	71.26%	37.91%	65.46%
VACE-P1.3B	48.98%	47.34%	96.8%	12.03%	16.59%	71.38%	40.19%	64.31%

BibTeX

@article{li2025bindweave,
  title={BindWeave: Subject-Consistent Video Generation via Cross-Modal Integration},
  author={Li, Zhaoyang and Qian, Dongjun and Su, Kai and Diao, Qishuai and Xia, Xiangyang and Liu, Chang and Yang, Wenfei and Zhang, Tianzhu and Yuan, Zehuan},
  journal={arXiv preprint arXiv:2510.00438},
  year={2025}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 1 Ask for provider support