Spaces:
Running
Running
| import streamlit as st | |
| from streamlit_extras.switch_page_button import switch_page | |
| st.title("DocOwl 1.5") | |
| st.success("""[Original tweet](https://twitter.com/mervenoyann/status/1782421257591357824) (April 22, 2024)""", icon="βΉοΈ") | |
| st.markdown(""" """) | |
| st.markdown("""DocOwl 1.5 is the state-of-the-art document understanding model by Alibaba with Apache 2.0 license ππ | |
| Time to dive in and learn more π§Ά | |
| """) | |
| st.markdown(""" """) | |
| st.image("pages/DocOwl_1.5/image_1.jpg", use_column_width=True) | |
| st.markdown(""" """) | |
| st.markdown("""This model consists of a ViT-based visual encoder part that takes in crops of image and the original image itself. | |
| Then the outputs of the encoder goes through a convolution based model, after that the outputs are merged with text and then fed to LLM. | |
| """) | |
| st.markdown(""" """) | |
| st.image("pages/DocOwl_1.5/image_2.jpeg", use_column_width=True) | |
| st.markdown(""" """) | |
| st.markdown(""" | |
| Initially, the authors only train the convolution based part (called H-Reducer) and vision encoder while keeping LLM frozen. | |
| Then for fine-tuning (on image captioning, VQA etc), they freeze vision encoder and train H-Reducer and LLM. | |
| """) | |
| st.markdown(""" """) | |
| st.image("pages/DocOwl_1.5/image_3.jpeg", use_column_width=True) | |
| st.markdown(""" """) | |
| st.markdown("""Also they use simple linear projection on text and documents. You can see below how they model the text prompts and outputs π€ | |
| """) | |
| st.markdown(""" """) | |
| st.image("pages/DocOwl_1.5/image_4.jpeg", use_column_width=True) | |
| st.markdown(""" """) | |
| st.markdown("""They train the model various downstream tasks including: | |
| - document understanding (DUE benchmark and more) | |
| - table parsing (TURL, PubTabNet) | |
| - chart parsing (PlotQA and more) | |
| - image parsing (OCR-CC) | |
| - text localization (DocVQA and more) | |
| """) | |
| st.markdown(""" """) | |
| st.image("pages/DocOwl_1.5/image_5.jpeg", use_column_width=True) | |
| st.markdown(""" """) | |
| st.markdown(""" | |
| They contribute a new model called DocOwl 1.5-Chat by: | |
| 1. creating a new document-chat dataset with questions from document VQA datasets | |
| 2. feeding them to ChatGPT to get long answers | |
| 3. fine-tune the base model with it (which IMO works very well!) | |
| """) | |
| st.markdown(""" """) | |
| st.image("pages/DocOwl_1.5/image_6.jpeg", use_column_width=True) | |
| st.markdown(""" """) | |
| st.markdown(""" | |
| Resulting generalist model and the chat model are pretty much state-of-the-art π | |
| Below you can see how it compares to fine-tuned models. | |
| """) | |
| st.markdown(""" """) | |
| st.image("pages/DocOwl_1.5/image_7.jpeg", use_column_width=True) | |
| st.markdown(""" """) | |
| st.markdown("""All the models and the datasets (also some eval datasets on above tasks!) are in this [organization](https://t.co/sJdTw1jWTR). | |
| The [Space](https://t.co/57E9DbNZXf). | |
| """) | |
| st.markdown(""" """) | |
| st.image("pages/DocOwl_1.5/image_8.jpeg", use_column_width=True) | |
| st.markdown(""" """) | |
| st.info(""" | |
| Ressources: | |
| [mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding](https://arxiv.org/abs/2403.12895) | |
| by Anwen Hu, Haiyang Xu, Jiabo Ye, Ming Yan, Liang Zhang, Bo Zhang, Chen Li, Ji Zhang, Qin Jin, Fei Huang, Jingren Zhou (2024) | |
| [GitHub](https://github.com/X-PLUG/mPLUG-DocOwl)""", icon="π") | |
| st.markdown(""" """) | |
| st.markdown(""" """) | |
| st.markdown(""" """) | |
| col1, col2, col3 = st.columns(3) | |
| with col1: | |
| if st.button('Previous paper', use_container_width=True): | |
| switch_page("Grounding DINO") | |
| with col2: | |
| if st.button('Home', use_container_width=True): | |
| switch_page("Home") | |
| with col3: | |
| if st.button('Next paper', use_container_width=True): | |
| switch_page("PLLaVA") |