Safetensors

BindWeave: Subject-Consistent Video Generation via Cross-Modal Integration

arXiv  project page 

BindWeave: Subject-Consistent Video Generation via Cross-Modal Integration

Zhaoyang Li 1,2, Dongjun Qian 2, Kai Su 2*, Qishuai Diao 2, Xiangyang Xia 2, Chang Liu 2, Wenfei Yang 1, Tianzhu Zhang 1*, Zehuan Yuan 2

1University of Science and Technology of China 2ByteDance
*Corresponding Author

📖 Overview

BindWeave is a unified subject-consistent video generation framework for single- and multi-subject prompts, built on an MLLM-DiT architecture that couples a pretrained multimodal large language model with a diffusion transformer. It achieves cross-modal integration via entity grounding and representation alignment, leveraging the MLLM to parse complex prompts and produce subject-aware hidden states that condition the DiT for high-fidelity generation. For more details or tutorials refer to ByteDance/BindWeave

OpenS2V-Eval Performance 🏆

BindWeave achieves a solid score of 57.61 on the OpenS2V-Eval benchmark, highlighting its robust capabilities across multiple evaluation dimensions and demonstrating competitive performance against several leading open-source and commercial systems.

Model TotalScore↑ AestheticScore↑ MotionSmoothness↑ MotionAmplitude↑ FaceSim↑ GmeScore↑ NexusScore↑ NaturalScore↑
BindWeave 57.61% 45.55% 95.90% 13.91% 53.71% 67.79% 46.84% 66.85%
VACE-14B 57.55% 47.21% 94.97% 15.02% 55.09% 67.27% 44.08% 67.04%
Phantom-14B 56.77% 46.39% 96.31% 33.42% 51.46% 70.65% 37.43% 69.35%
Kling1.6(20250503) 56.23% 44.59% 86.93% 41.6% 40.1% 66.2% 45.89% 74.59%
Phantom-1.3B 54.89% 46.67% 93.3% 14.29% 48.56% 69.43% 42.48% 62.5%
MAGREF-480P 52.51% 45.02% 93.17% 21.81% 30.83% 70.47% 43.04% 66.9%
SkyReels-A2-P14B 52.25% 39.41% 87.93% 25.6% 45.95% 64.54% 43.75% 60.32%
Vidu2.0(20250503) 51.95% 41.48% 90.45% 13.52% 35.11% 67.57% 43.37% 65.88%
Pika2.1(20250503) 51.88% 46.88% 87.06% 24.71% 30.38% 69.19% 45.4% 63.32%
VACE-1.3B 49.89% 48.24% 97.2% 18.83% 20.57% 71.26% 37.91% 65.46%
VACE-P1.3B 48.98% 47.34% 96.8% 12.03% 16.59% 71.38% 40.19% 64.31%

BibTeX

@article{li2025bindweave,
  title={BindWeave: Subject-Consistent Video Generation via Cross-Modal Integration},
  author={Li, Zhaoyang and Qian, Dongjun and Su, Kai and Diao, Qishuai and Xia, Xiangyang and Liu, Chang and Yang, Wenfei and Zhang, Tianzhu and Yuan, Zehuan},
  journal={arXiv preprint arXiv:2510.00438},
  year={2025}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 1 Ask for provider support