Is the Qwen3-VL Vision Encoder Based on siglip2-so400m-patch16?

by JosephusCheung - opened Sep 23

Sep 23

Great to see the massive leap forward with Qwen3-VL. I noticed the vision encoder's profile is remarkably similar to Google's siglip2-so400m-patch16, which is a major change from your previous in-house encoder.

Was there a specific reason to omit this piece of information from the model card?

BingoBird

Sep 24

If I may offer a speculation - they have done some work on the vision encoder that they decided not to reveal via the model card.

JosephusCheung

Sep 24

If I may offer a speculation - they have done some work on the vision encoder that they decided not to reveal via the model card.

However, the siglip2-so400m-patch16 model is licensed under Apache 2.0, which requires the preservation of the original copyright and attribution notices.

And it was mentioned in the tech report of omni 3.

ShuaiBai623

Qwen org Sep 24

The initial ViT is initialized based on siglip2-so400m-patch16, and we have also made some modifications to the vision encoder, which will be detailed in the paper.

yonghenglh6

Sep 24

The initial ViT is initialized based on siglip2-so400m-patch16, and we have also made some modifications to the vision encoder, which will be detailed in the paper.

May I ask if the VE of Qwen 2.5 VL is trained from scratch? And if I may ask again, is the use of SigLip as VE due to the internal VE not being fully trained yet?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment