arxiv:2404.06212

OmniFusion Technical Report

Published on Apr 9, 2024

· Submitted by

AK on Apr 10, 2024

Upvote

Authors:

Elizaveta Goncharova ,

Anton Razzhigaev ,

Matvey Mikhalchuk ,

Maxim Kurkin ,

Irina Abdullaeva ,

Matvey Skripkin ,

Ivan Oseledets ,

Denis Dimitrov ,

Andrey Kuznetsov

Abstract

The OmniFusion model, leveraging pretrained LLMs and visual adapters, demonstrates superior performance across multiple visual-language benchmarks compared to open-source alternatives.

AI-generated summary

Last year, multimodal architectures served up a revolution in AI-based approaches and solutions, extending the capabilities of large language models (LLM). We propose an OmniFusion model based on a pretrained LLM and adapters for visual modality. We evaluated and compared several architecture design principles for better text and visual data coupling: MLP and transformer adapters, various CLIP ViT-based encoders (SigLIP, InternVIT, etc.), and their fusing approach, image encoding method (whole image or tiles encoding) and two 7B LLMs (the proprietary one and open-source Mistral). Experiments on 8 visual-language benchmarks show the top score for the best OmniFusion setup in terms of different VQA tasks in comparison with open-source LLaVA-like solutions: VizWiz, Pope, MM-Vet, ScienceQA, MMBench, TextVQA, VQAv2, MMMU. We also propose a variety of situations, where OmniFusion provides highly-detailed answers in different domains: housekeeping, sightseeing, culture, medicine, handwritten and scanned equations recognition, etc. Mistral-based OmniFusion model is an open-source solution with weights, training and inference scripts available at https://github.com/AIRI-Institute/OmniFusion.

View arXiv page View PDF Add to collection

Community

apanc

Apr 10, 2024

Great job

kuznetsoffandrey

Paper author Apr 10, 2024

Thank you!

nicolay-r

Apr 10, 2024

Well done, and thanks for sharing this!

kuznetsoffandrey

Paper author Apr 10, 2024

Thanks! If any questions, do not hesitate to ask

nicolay-r

Apr 11, 2024

•

edited Apr 11, 2024

@kuznetsoffandrey , there is a question on preliminary experiments at small scale that I believe were not covered in this paper. So, one way of ending up into to the specific model architecture setup is to experiment with ablations and data mixtures at a small scale models: 600M/1B/3B (https://arxiv.org/pdf/2403.09611.pdf)
In the case of OmniFusion architecture recipe, have you performed preliminary experiments with even smaller scaled LLMs models besides those one mentioned in paper? (Mistral-7B)

kuznetsoffandrey

Paper author Apr 12, 2024

This is a very good point. We did not try it yet, however we have a 3B proprietary model. I think we will share some experiments soon with smaller models