Spaces:
Running
Running
| import streamlit as st | |
| from streamlit_extras.switch_page_button import switch_page | |
| translations = { | |
| 'en': {'title': 'DocOwl 1.5', | |
| 'original_tweet': | |
| """ | |
| [Original tweet](https://twitter.com/mervenoyann/status/1782421257591357824) (April 22, 2024) | |
| """, | |
| 'tweet_1': | |
| """ | |
| DocOwl 1.5 is the state-of-the-art document understanding model by Alibaba with Apache 2.0 license 😍📝 | |
| Time to dive in and learn more 🧶 | |
| """, | |
| 'tweet_2': | |
| """ | |
| This model consists of a ViT-based visual encoder part that takes in crops of image and the original image itself. | |
| Then the outputs of the encoder goes through a convolution based model, after that the outputs are merged with text and then fed to LLM. | |
| """, | |
| 'tweet_3': | |
| """ | |
| Initially, the authors only train the convolution based part (called H-Reducer) and vision encoder while keeping LLM frozen. | |
| Then for fine-tuning (on image captioning, VQA etc), they freeze vision encoder and train H-Reducer and LLM. | |
| """, | |
| 'tweet_4': | |
| """ | |
| Also they use simple linear projection on text and documents. You can see below how they model the text prompts and outputs 🤓 | |
| """, | |
| 'tweet_5': | |
| """ | |
| They train the model various downstream tasks including: | |
| - document understanding (DUE benchmark and more) | |
| - table parsing (TURL, PubTabNet) | |
| - chart parsing (PlotQA and more) | |
| - image parsing (OCR-CC) | |
| - text localization (DocVQA and more) | |
| """, | |
| 'tweet_6': | |
| """ | |
| They contribute a new model called DocOwl 1.5-Chat by: | |
| 1. creating a new document-chat dataset with questions from document VQA datasets | |
| 2. feeding them to ChatGPT to get long answers | |
| 3. fine-tune the base model with it (which IMO works very well!) | |
| """, | |
| 'tweet_7': | |
| """ | |
| Resulting generalist model and the chat model are pretty much state-of-the-art 😍 | |
| Below you can see how it compares to fine-tuned models. | |
| """, | |
| 'tweet_8': | |
| """ | |
| All the models and the datasets (also some eval datasets on above tasks!) are in this [organization](https://t.co/sJdTw1jWTR). | |
| The [Space](https://t.co/57E9DbNZXf). | |
| """, | |
| 'ressources': | |
| """ | |
| Ressources: | |
| [mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding](https://arxiv.org/abs/2403.12895) | |
| by Anwen Hu, Haiyang Xu, Jiabo Ye, Ming Yan, Liang Zhang, Bo Zhang, Chen Li, Ji Zhang, Qin Jin, Fei Huang, Jingren Zhou (2024) | |
| [GitHub](https://github.com/X-PLUG/mPLUG-DocOwl) | |
| """ | |
| }, | |
| 'fr': { | |
| 'title': 'DocOwl 1.5', | |
| 'original_tweet': | |
| """ | |
| [Tweet de base](https://twitter.com/mervenoyann/status/1782421257591357824) (en anglais) (22 avril 2024) | |
| """, | |
| 'tweet_1': | |
| """ | |
| DocOwl 1.5 est le modèle de compréhension de documents d'Alibaba sous licence Apache 2.0 😍📝 | |
| Il est temps de découvrir ce modèle 🧶 | |
| """, | |
| 'tweet_2': | |
| """ | |
| Ce modèle se compose d'un encodeur visuel basé sur un ViT qui prend en compte les crops de l'image et l'image originale elle-même. | |
| Les sorties de l'encodeur passent ensuite par un modèle convolutif, après quoi les sorties sont fusionnées avec le texte, puis transmises au LLM. | |
| """, | |
| 'tweet_3': | |
| """ | |
| Au départ, les auteurs n'entraînent que la partie basée sur la convolution (appelée H-Reducer) et l'encodeur de vision tout en gardant le LLM gelé. | |
| Ensuite, pour le finetuning (légendage d'image, VQA, etc.), ils gèlent l'encodeur de vision et entraînent le H-Reducer et le LLM. | |
| """, | |
| 'tweet_4': | |
| """ | |
| Ils utilisent également une simple projection linéaire sur le texte et les documents. Vous pouvez voir ci-dessous comment ils modélisent les prompts et les sorties textuelles 🤓 | |
| """, | |
| 'tweet_5': | |
| """ | |
| Ils entraînent le modèle pour diverses tâches en aval, notamment | |
| - la compréhension de documents (DUE benchmark et autres) | |
| - analyse de tableaux (TURL, PubTabNet) | |
| - analyse de graphiques (PlotQA et autres) | |
| - analyse d'images (OCR-CC) | |
| - localisation de textes (DocVQA et autres) | |
| """, | |
| 'tweet_6': | |
| """ | |
| Ils contribuent à un nouveau modèle appelé DocOwl 1.5-Chat en : | |
| 1. créant un nouveau jeu de données document-chat avec des questions provenant de jeux de données VQA | |
| 2. en les envoyant à ChatGPT pour obtenir des réponses longues | |
| 3. en finetunant le modèle de base à l'aide de ce dernier (qui fonctionne très bien selon moi) | |
| """, | |
| 'tweet_7': | |
| """ | |
| Le modèle généraliste qui en résulte et le modèle de chat sont pratiquement à l'état de l'art 😍 | |
| Ci-dessous, vous pouvez voir comment ils se comparent aux modèles finetunés. | |
| """, | |
| 'tweet_8': | |
| """ | |
| Tous les modèles et jeux de données (y compris certains jeux de données d'évaluation sur les tâches susmentionnées !) se trouvent dans cette [organisation](https://t.co/sJdTw1jWTR). Le [Space](https://t.co/57E9DbNZXf). | |
| """, | |
| 'ressources': | |
| """ | |
| Ressources : | |
| [mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding](https://arxiv.org/abs/2403.12895) | |
| de Anwen Hu, Haiyang Xu, Jiabo Ye, Ming Yan, Liang Zhang, Bo Zhang, Chen Li, Ji Zhang, Qin Jin, Fei Huang, Jingren Zhou (2024) | |
| [GitHub](https://github.com/X-PLUG/mPLUG-DocOwl) | |
| """ | |
| } | |
| } | |
| def language_selector(): | |
| languages = {'EN': '🇬🇧', 'FR': '🇫🇷'} | |
| selected_lang = st.selectbox('', options=list(languages.keys()), format_func=lambda x: languages[x], key='lang_selector') | |
| return 'en' if selected_lang == 'EN' else 'fr' | |
| left_column, right_column = st.columns([5, 1]) | |
| # Add a selector to the right column | |
| with right_column: | |
| lang = language_selector() | |
| # Add a title to the left column | |
| with left_column: | |
| st.title(translations[lang]["title"]) | |
| st.success(translations[lang]["original_tweet"], icon="ℹ️") | |
| st.markdown(""" """) | |
| st.markdown(translations[lang]["tweet_1"], unsafe_allow_html=True) | |
| st.markdown(""" """) | |
| st.image("pages/DocOwl_1.5/image_1.jpg", use_container_width=True) | |
| st.markdown(""" """) | |
| st.markdown(translations[lang]["tweet_2"], unsafe_allow_html=True) | |
| st.markdown(""" """) | |
| st.image("pages/DocOwl_1.5/image_2.jpeg", use_container_width=True) | |
| st.markdown(""" """) | |
| st.markdown(translations[lang]["tweet_3"], unsafe_allow_html=True) | |
| st.markdown(""" """) | |
| st.image("pages/DocOwl_1.5/image_3.jpeg", use_container_width=True) | |
| st.markdown(""" """) | |
| st.markdown(translations[lang]["tweet_4"], unsafe_allow_html=True) | |
| st.markdown(""" """) | |
| st.image("pages/DocOwl_1.5/image_4.jpeg", use_container_width=True) | |
| st.markdown(""" """) | |
| st.markdown(translations[lang]["tweet_5"], unsafe_allow_html=True) | |
| st.markdown(""" """) | |
| st.image("pages/DocOwl_1.5/image_5.jpeg", use_container_width=True) | |
| st.markdown(""" """) | |
| st.markdown(translations[lang]["tweet_6"], unsafe_allow_html=True) | |
| st.markdown(""" """) | |
| st.image("pages/DocOwl_1.5/image_6.jpeg", use_container_width=True) | |
| st.markdown(""" """) | |
| st.markdown(translations[lang]["tweet_7"], unsafe_allow_html=True) | |
| st.markdown(""" """) | |
| st.image("pages/DocOwl_1.5/image_7.jpeg", use_container_width=True) | |
| st.markdown(""" """) | |
| st.markdown(translations[lang]["tweet_8"], unsafe_allow_html=True) | |
| st.markdown(""" """) | |
| st.image("pages/DocOwl_1.5/image_8.jpeg", use_container_width=True) | |
| st.markdown(""" """) | |
| st.info(translations[lang]["ressources"], icon="📚") | |
| st.markdown(""" """) | |
| st.markdown(""" """) | |
| st.markdown(""" """) | |
| col1, col2, col3= st.columns(3) | |
| with col1: | |
| if lang == "en": | |
| if st.button('Previous paper', use_container_width=True): | |
| switch_page("Grounding DINO") | |
| else: | |
| if st.button('Papier précédent', use_container_width=True): | |
| switch_page("Grounding DINO") | |
| with col2: | |
| if lang == "en": | |
| if st.button("Home", use_container_width=True): | |
| switch_page("Home") | |
| else: | |
| if st.button("Accueil", use_container_width=True): | |
| switch_page("Home") | |
| with col3: | |
| if lang == "en": | |
| if st.button("Next paper", use_container_width=True): | |
| switch_page("MiniGemini") | |
| else: | |
| if st.button("Papier suivant", use_container_width=True): | |
| switch_page("MiniGemini") |