Is the Qwen3-VL Vision Encoder Based on siglip2-so400m-patch16?

#1
by JosephusCheung - opened

Great to see the massive leap forward with Qwen3-VL. I noticed the vision encoder's profile is remarkably similar to Google's siglip2-so400m-patch16, which is a major change from your previous in-house encoder.

Was there a specific reason to omit this piece of information from the model card?

If I may offer a speculation - they have done some work on the vision encoder that they decided not to reveal via the model card.

If I may offer a speculation - they have done some work on the vision encoder that they decided not to reveal via the model card.

However, the siglip2-so400m-patch16 model is licensed under Apache 2.0, which requires the preservation of the original copyright and attribution notices.

And it was mentioned in the tech report of omni 3.

image.png

Qwen org

The initial ViT is initialized based on siglip2-so400m-patch16, and we have also made some modifications to the vision encoder, which will be detailed in the paper.

The initial ViT is initialized based on siglip2-so400m-patch16, and we have also made some modifications to the vision encoder, which will be detailed in the paper.

May I ask if the VE of Qwen 2.5 VL is trained from scratch? And if I may ask again, is the use of SigLip as VE due to the internal VE not being fully trained yet?

Sign up or log in to comment