Spaces:
Running
Running
| import streamlit as st | |
| from streamlit_extras.switch_page_button import switch_page | |
| translations = { | |
| 'en': {'title': 'KOSMOS-2', | |
| 'original_tweet': | |
| """ | |
| [Original tweet](https://x.com/mervenoyann/status/1720126908384366649) (November 2, 2023) | |
| """, | |
| 'tweet_1': | |
| """ | |
| New 🤗 Transformers release includes a very powerful Multimodel Large Language Model (MLLM) by @Microsoft called KOSMOS-2! 🤩 | |
| The highlight of KOSMOS-2 is grounding, the model is *incredibly* accurate! 🌎 | |
| Play with the demo [here](https://huggingface.co/spaces/ydshieh/Kosmos-2) by [@ydshieh](https://x.com/ydshieh). | |
| But how does this model work? Let's take a look! 👀🧶 | |
| """, | |
| 'tweet_2': | |
| """ | |
| Grounding helps machine learning models relate to real-world examples. Including grounding makes models more performant by means of accuracy and robustness during inference. It also helps reduce the so-called "hallucinations" in language models. | |
| """, | |
| 'tweet_3': | |
| """ | |
| In KOSMOS-2, model is grounded to perform following tasks and is evaluated on 👇 | |
| - multimodal grounding & phrase grounding, e.g. localizing the object through natural language query | |
| - multimodal referring, e.g. describing object characteristics & location | |
| - perception-language tasks | |
| - language understanding and generation | |
| """, | |
| 'tweet_4': | |
| """ | |
| The dataset used for grounding, called GRiT is also available on [Hugging Face Hub](https://huggingface.co/datasets/zzliang/GRIT). | |
| Thanks to 🤗 Transformers integration, you can use KOSMOS-2 with few lines of code 🤩 | |
| See below! 👇 | |
| """, | |
| 'ressources': | |
| """ | |
| Ressources: | |
| [Kosmos-2: Grounding Multimodal Large Language Models to the World](https://arxiv.org/abs/2306.14824) | |
| by Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, Furu Wei (2023) | |
| [GitHub](https://github.com/microsoft/unilm/tree/master/kosmos-2) | |
| [Hugging Face documentation](https://huggingface.co/docs/transformers/model_doc/kosmos-2) | |
| """ | |
| }, | |
| 'fr': { | |
| 'title': 'KOSMOS-2', | |
| 'original_tweet': | |
| """ | |
| [Tweet de base](https://x.com/mervenoyann/status/1720126908384366649) (en anglais) (2 novembre 2023) | |
| """, | |
| 'tweet_1': | |
| """ | |
| La nouvelle version de 🤗 Transformers inclut un très puissant <i>Multimodel Large Language Model</i> (MLLM) de @Microsoft appelé KOSMOS-2 ! 🤩 | |
| Le point fort de KOSMOS-2 est l'ancrage, le modèle est *incroyablement* précis ! 🌎 | |
| Jouez avec la démo [ici](https://huggingface.co/spaces/ydshieh/Kosmos-2) de [@ydshieh](https://x.com/ydshieh). | |
| Mais comment fonctionne t'il ? Jetons un coup d'œil ! 👀🧶 | |
| """, | |
| 'tweet_2': | |
| """ | |
| L'ancrage permet aux modèles d'apprentissage automatique d'être liés à des exemples du monde réel. L'inclusion de l'ancrage rend les modèles plus performants en termes de précision et de robustesse lors de l'inférence. Cela permet également de réduire les « hallucinations » dans les modèles de langage. """, | |
| 'tweet_3': | |
| """ | |
| Dans KOSMOS-2, le modèle est ancré pour effectuer les tâches suivantes et est évalué sur 👇 | |
| - l'ancrage multimodal et l'ancrage de phrases, par exemple la localisation de l'objet par le biais d'une requête en langage naturel | |
| - la référence multimodale, par exemple la description des caractéristiques et de l'emplacement de l'objet | |
| - tâches de perception-langage | |
| - compréhension et génération du langage | |
| """, | |
| 'tweet_4': | |
| """ | |
| Le jeu de données utilisé pour l'ancrage, appelé GRiT, est également disponible sur le [Hub d'Hugging Face](https://huggingface.co/datasets/zzliang/GRIT). | |
| Grâce à l'intégration dans 🤗 Transformers, vous pouvez utiliser KOSMOS-2 avec quelques lignes de code 🤩. | |
| Voir ci-dessous ! 👇 | |
| """, | |
| 'ressources': | |
| """ | |
| Ressources : | |
| [Kosmos-2: Grounding Multimodal Large Language Models to the World](https://arxiv.org/abs/2306.14824) | |
| de Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, Furu Wei (2023) | |
| [GitHub](https://github.com/microsoft/unilm/tree/master/kosmos-2) | |
| [Documentation d'Hugging Face](https://huggingface.co/docs/transformers/model_doc/kosmos-2) | |
| """ | |
| } | |
| } | |
| def language_selector(): | |
| languages = {'EN': '🇬🇧', 'FR': '🇫🇷'} | |
| selected_lang = st.selectbox('', options=list(languages.keys()), format_func=lambda x: languages[x], key='lang_selector') | |
| return 'en' if selected_lang == 'EN' else 'fr' | |
| left_column, right_column = st.columns([5, 1]) | |
| # Add a selector to the right column | |
| with right_column: | |
| lang = language_selector() | |
| # Add a title to the left column | |
| with left_column: | |
| st.title(translations[lang]["title"]) | |
| st.success(translations[lang]["original_tweet"], icon="ℹ️") | |
| st.markdown(""" """) | |
| st.markdown(translations[lang]["tweet_1"], unsafe_allow_html=True) | |
| st.markdown(""" """) | |
| st.video("pages/KOSMOS-2/video_1.mp4", format="video/mp4") | |
| st.markdown(""" """) | |
| st.markdown(translations[lang]["tweet_2"], unsafe_allow_html=True) | |
| st.markdown(""" """) | |
| st.markdown(translations[lang]["tweet_3"], unsafe_allow_html=True) | |
| st.markdown(""" """) | |
| st.markdown(translations[lang]["tweet_4"], unsafe_allow_html=True) | |
| st.markdown(""" """) | |
| st.image("pages/KOSMOS-2/image_1.jpg", use_container_width=True) | |
| st.markdown(""" """) | |
| with st.expander ("Code"): | |
| if lang == "en": | |
| st.code(""" | |
| from transformers import AutoProcessor, AutoModelForVision2Seq | |
| model = AutoModelForVision2Seq.from_pretrained("microsoft/kosmos-2-patch14-224").to("cuda") | |
| processor = AutoProcessor.from_pretrained("microsoft/kosmos-2-patch14-224") | |
| image_input = Image.open(user_image_path) | |
| # prepend different preprompts optionally to describe images | |
| brief_preprompt = "<grounding>An image of" | |
| detailed_preprompt = "<grounding>Describe this image in detail:" | |
| inputs = processor(text=text_input, images=image_input, return_tensors="pt").to("cuda") | |
| generated_ids = model.generate( | |
| pixel_values=inputs["pixel_values"], | |
| input_ids=inputs["input_ids"], | |
| attention_mask=inputs["attention_mask"], | |
| image_embeds=None, | |
| image_embeds_position_mask=inputs["image_embeds_position_mask"], | |
| use_cache=True, | |
| max_new_tokens=128, | |
| ) | |
| generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0] | |
| processed_text, entities = processor.post_process_generation(generated_text) | |
| # check out the Space for inference with bbox drawing | |
| """) | |
| else: | |
| st.code(""" | |
| from transformers import AutoProcessor, AutoModelForVision2Seq | |
| model = AutoModelForVision2Seq.from_pretrained("microsoft/kosmos-2-patch14-224").to("cuda") | |
| processor = AutoProcessor.from_pretrained("microsoft/kosmos-2-patch14-224") | |
| image_input = Image.open(user_image_path) | |
| # ajouter différents préprompts facultatifs pour décrire les images | |
| brief_preprompt = "<grounding>An image of" | |
| detailed_preprompt = "<grounding>Describe this image in detail:" | |
| inputs = processor(text=text_input, images=image_input, return_tensors="pt").to("cuda") | |
| generated_ids = model.generate( | |
| pixel_values=inputs["pixel_values"], | |
| input_ids=inputs["input_ids"], | |
| attention_mask=inputs["attention_mask"], | |
| image_embeds=None, | |
| image_embeds_position_mask=inputs["image_embeds_position_mask"], | |
| use_cache=True, | |
| max_new_tokens=128, | |
| ) | |
| generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0] | |
| processed_text, entities = processor.post_process_generation(generated_text) | |
| # consultez le Space pour l'inférence avec le tracé des bbox | |
| """) | |
| st.markdown(""" """) | |
| st.info(translations[lang]["ressources"], icon="📚") | |
| st.markdown(""" """) | |
| st.markdown(""" """) | |
| st.markdown(""" """) | |
| col1, col2, col3= st.columns(3) | |
| with col1: | |
| if lang == "en": | |
| if st.button('Previous paper', use_container_width=True): | |
| switch_page("Home") | |
| else: | |
| if st.button('Papier précédent', use_container_width=True): | |
| switch_page("Home") | |
| with col2: | |
| if lang == "en": | |
| if st.button("Home", use_container_width=True): | |
| switch_page("Home") | |
| else: | |
| if st.button("Accueil", use_container_width=True): | |
| switch_page("Home") | |
| with col3: | |
| if lang == "en": | |
| if st.button("Next paper", use_container_width=True): | |
| switch_page("MobileSAM") | |
| else: | |
| if st.button("Papier suivant", use_container_width=True): | |
| switch_page("MobileSAM") | |