Papers
arxiv:2404.06212

OmniFusion Technical Report

Published on Apr 9, 2024
Β· Submitted by AK on Apr 10, 2024

Abstract

The OmniFusion model, leveraging pretrained LLMs and visual adapters, demonstrates superior performance across multiple visual-language benchmarks compared to open-source alternatives.

AI-generated summary

Last year, multimodal architectures served up a revolution in AI-based approaches and solutions, extending the capabilities of large language models (LLM). We propose an OmniFusion model based on a pretrained LLM and adapters for visual modality. We evaluated and compared several architecture design principles for better text and visual data coupling: MLP and transformer adapters, various CLIP ViT-based encoders (SigLIP, InternVIT, etc.), and their fusing approach, image encoding method (whole image or tiles encoding) and two 7B LLMs (the proprietary one and open-source Mistral). Experiments on 8 visual-language benchmarks show the top score for the best OmniFusion setup in terms of different VQA tasks in comparison with open-source LLaVA-like solutions: VizWiz, Pope, MM-Vet, ScienceQA, MMBench, TextVQA, VQAv2, MMMU. We also propose a variety of situations, where OmniFusion provides highly-detailed answers in different domains: housekeeping, sightseeing, culture, medicine, handwritten and scanned equations recognition, etc. Mistral-based OmniFusion model is an open-source solution with weights, training and inference scripts available at https://github.com/AIRI-Institute/OmniFusion.

Community

Great job

Β·

Thank you!

Well done, and thanks for sharing this!

Β·

Thanks! If any questions, do not hesitate to ask

@kuznetsoffandrey , there is a question on preliminary experiments at small scale that I believe were not covered in this paper. So, one way of ending up into to the specific model architecture setup is to experiment with ablations and data mixtures at a small scale models: 600M/1B/3B (https://arxiv.org/pdf/2403.09611.pdf)
In the case of OmniFusion architecture recipe, have you performed preliminary experiments with even smaller scaled LLMs models besides those one mentioned in paper? (Mistral-7B)

Β·

This is a very good point. We did not try it yet, however we have a 3B proprietary model. I think we will share some experiments soon with smaller models

Ну ΠΎΡ‡Π΅Π½ΡŒ awesome!

OmniFusion: Revolutionizing Multimodal AI with Text and Image Integration

Links πŸ”—:

πŸ‘‰ Subscribe: https://www.youtube.com/@Arxflix
πŸ‘‰ Twitter: https://x.com/arxflix
πŸ‘‰ LMNT (Partner): https://lmnt.com/

By Arxflix
9t4iCUHx_400x400-1.jpg

Sign up or log in to comment

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2404.06212 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2404.06212 in a Space README.md to link it from this page.

Collections including this paper 20